customer satisfaction in retail financial services

On the Meaning of Customer Satisfaction A Study in the Context of Retail Banking

Maarten Terpstra

Printed by: Offsetdrukkerij Ridderprint B.V., Ridderkerk ISBN/EAN: 978-90-5335-171-0 Copyright: © Maarten Terpstra

On the Meaning of Customer Satisfaction A Study in the Context of Retail Banking

Proefschrift

ter verkrijging van de graad van doctor aan de Universiteit van Tilburg, op gezag van de

rector magnificus, prof. dr. F.A. van der Duyn Schouten, in het openbaar te verdedigen ten

overstaan van een door het college voor promoties aangewezen commissie in de aula van de

Universiteit op vrijdag 14 november 2008 om 14.15 uur door

Maarten Jan Terpstra

geboren op 24 augustus 1969 te Boxmeer

Promotores: Prof. dr. A.A.A. Kuijlen

Prof. dr. K. Sijtsma

Preface

There exists confusion about the meaning of psychological properties. This is because a

psychological property is not a thing within a person, but an organisational principle with

respect to behaviour of persons. This may sound odd, but it means that a psychological

property is a theoretical concept which we use to interpret and describe behaviour of persons.

The latter years I studied the meaning of the psychological property satisfaction. The results

of the study are reported in this thesis.

First of all, I want to express my gratitude to my promotores Ton Kuijlen and Klaas

Sijtsma. They taught me how to do scientific research, they helped and inspired me, and I

have enjoyed our cooperation. I am also grateful to ING for facilitating my study, and to

many colleagues from ING for their support and interest in my results. Furthermore, I thank

Tom Breur for his support and his feedback on the many drafts he read. Most of all, I thank

Monique for her confidence in me finishing my study and for her unconditional support

throughout the years.

Contents

Chapter 1 Introduction 1

Chapter 2 Measurement of psychological constructs 11

Chapter 3 The theoretical meaning of customer satisfaction 33

Chapter 4 Deductive design for test development and construct

validation

Chapter 5 Method of the first empirical study into customer satisfaction

with BANK

Chapter 6 Results of the first empirical study into customer satisfaction

with BANK

Chapter 7 Method of the second empirical study into customer

satisfaction with BANK

Chapter 8 Results of the second empirical study into customer

satisfaction with BANK

Chapter 9 General discussion 183

References 191

Samenvatting (Summary in Dutch) 205

Appendices 211

Chapter 1

Introduction

1 Introduction

Satisfaction is an important concept in societal contexts, business contexts, and academic

contexts. This is evidenced by the vast amount of studies that were conducted with respect to

satisfaction in various contexts. Ironically, satisfaction seems to be a somewhat elusive

phenomenon. It is as Oliver (1997, p. 13) noted: ‘Everyone knows what satisfaction is, until

asked to give a definition. Then it seems, nobody knows.’ This warrants further research into

the meaning of satisfaction.

The subject of this thesis is the unravelling of the meaning of customer satisfaction in

the context of retail banking. The phrase meaning of customer satisfaction has multiple

connotations. In this thesis, it refers to (a) the linguistic use of the term customer satisfaction,

(b) the theoretical framework of customer satisfaction, (c) the empirical indicators of

customer satisfaction, and (d) the importance of customer satisfaction in the domain of retail

banking. The thesis includes a theoretical study of customer satisfaction and an empirical

study into customer satisfaction with a major Dutch retail bank.

2 A typology of satisfaction studies

Satisfaction was studied in various settings and at various levels of aggregation (e.g., Oliver,

1997, pp. 15-17). This is reflected by the use of different terms, such as job satisfaction, life

satisfaction, consumer satisfaction, customer satisfaction, transaction-specific satisfaction,

attribute satisfaction, service satisfaction, summary satisfaction, and aggregated satisfaction.

The types of satisfaction are mutually related by what Wittgenstein (1953) labeled family

resemblances, meaning that they are mutually related in diverse ways. For example, consumer

satisfaction and customer satisfaction are closely related since both pertain to the satisfaction

response to consumption-related experiences, and these two terms were used more or less

interchangeably in the marketing literature (e.g., Giese & Cote, 2000). However, because

customer satisfaction is only appropriate for satisfaction in commercial contexts and

consumer satisfaction may also be used for satisfaction in other contexts, the domain of the

consumer satisfaction is larger than the domain of customer satisfaction.

There are also differences within each type of satisfaction with respect to the

characteristics of the satisfaction response. For example, the consumer satisfaction response

to dinner in a restaurant differs from the consumer satisfaction response to dental treatment.

Whereas the former satisfaction response may encompass a feeling of pleasure, the latter

satisfaction response may encompass a feeling of relief. Furthermore, a consumer satisfaction

response may reflect anhedonic cognitions (Oliver, 1997, p. 318), meaning that it reflects

cognitions that are not emotionally processed. An example is the consumer satisfaction

response to using a pencil.

It is useful to examine the difference between two types of satisfaction studies, which

are (a) studies that are conducted at the individual person level, and (b) studies that are

conducted at higher levels of aggregation. The first type of satisfaction studies is characterised

by analyses of person data. These are, for example, studies of satisfaction of persons with

single encounters with a phenomenon (i.e., transaction-specific satisfaction; Oliver 1997, p.

15), or studies of satisfaction of persons with the accumulation of encounters with a

phenomenon (i.e., summary satisfaction; Oliver, 1997, p. 15).

The second type of satisfaction studies is conducted at higher levels of aggregation, such

as a firm, an industry, or a society (Oliver, 1997, p. 15). These studies are characterised by the

analysis of satisfaction data that are aggregated at the level of firms, industries, or societies.

For example, several theorists (e.g., Anderson, Fornell, & Lehmann, 1994; Anderson, Fornell,

& Mazvancheryl, 2004; Anderson & Mittal, 2000; Gruca & Rego, 2005) used satisfaction

data at the firm level to study the connections between satisfaction and economic performance

of firms.

Thus, there are different types of satisfaction and different types of satisfaction studies.

The present satisfaction study is conducted at the individual person level, and is limited to

persons’ summary satisfaction with a company of which they are customers. We refer to this

kind of satisfaction as customer satisfaction (see also Chapter 3).

3 Satisfaction research in the marketing domain

Satisfaction is an important concept in marketing theory. Consequently, there is a vast amount

of studies into satisfaction in the marketing literature (e.g., Giese & Cote, 2000; Oliver, 1997;

Yi, 1990). Most studies dealt with satisfaction of consumers or customers with products or

services or companies providing products or services. In these studies, satisfaction is often

labeled consumer satisfaction or customer satisfaction, and is often measured by means of a

psychological test that is administered in survey research (Section 4).

Marketing theorists generally agree that satisfaction is a response to consumption-

related experiences (e.g., Anderson, Fornell & Lehmann, 1994; Giese & Cote, 2000; Oliver,

1997; Tse & Wilton, 1988; Yi, 1990). Still, there exist a variety of definitions and measures of

satisfaction in academic marketing research (e.g., Giese & Cote, 2000; Peterson & Wilson,

1992). Furthermore, the term satisfaction is sometimes applied to antecedents and sometimes

to consequences of satisfaction (Oliver, 1997, p. 15). The measurements of these antecedents

and consequences are sometimes used as proxies for satisfaction. Examples of concepts used

as proxies for satisfaction are quality perceptions, recommendation intentions, loyalty,

behaviour, and profits. Although these concepts may serve the purpose of specific studies,

they do not coincide with satisfaction. The use of these concepts as proxies for satisfaction

has contributed to the confusion about the meaning of satisfaction (Oliver, 1997, pp. 15-17).

On the basis of a review of the literature, Giese and Cote (2000) demonstrated a number

of deficiencies in the definition and measurement of satisfaction in studies that were

conducted in the last three decades. These deficiencies pertain to (a) the explication of the

definition of satisfaction, (b) the justification of the definition of satisfaction, and (c) the

justification of the measurement of satisfaction. The deficiencies hampered the development

and validation of satisfaction theory (e.g., Giese & Cote, 2000; Yi, 1990). Giese and Cote

(2000) argued that, as there exist multiple definitions of satisfaction, a researcher must

explicitly define satisfaction and justify the definition selected. Because it is impossible to

develop a universal definition of satisfaction, which is caused by the complexity and the

context-specific nature of satisfaction, they recommended the development of context-specific

definitions of satisfaction. This stance implies that measures of satisfaction should also be

context-specific, because the measure should match the definition of satisfaction. Giese and

Cote (2000) proposed a framework to guide researchers in developing a context-specific

definition and a corresponding measurement procedure for satisfaction.

The meaning of satisfaction thus is context-dependent. There are similarities and

differences in the meaning of satisfaction in different domains. Satisfaction with a retail bank

has both similarities and differences with satisfaction with dinner in a restaurant, satisfaction

with consumption of non-durable consumer goods, and satisfaction with consumption of

durable consumer goods. All pertain to the fulfilment response (Oliver, 1997, p. 13), but the

characteristics of the satisfaction response and the nomological network (Cronbach & Meehl,

1955) of satisfaction differ between these domains. These differences warrant the

development of context-specific definitions and corresponding measurement procedures for

satisfaction, as proposed by Giese and Cote (2000). Therefore, the first objective of this study

was to explore the theoretical meaning of customer satisfaction in the context of retail

banking, and to develop a context-specific definition and measurement procedure for

customer satisfaction.

4 Measurement of satisfaction

Satisfaction is a psychological property. Psychological properties are mostly conceived of as

theoretical constructions, which are labeled psychological constructs (e.g., Lord & Novick,

1968, p. 352; Nunally, 1978, p. 96) and which may be measured by means of psychological

tests and psychological questionnaires (e.g., Molenaar, 1995; Oosterveld, 1996, Schouwstra,

2000). Psychological tests and psychological questionnaires are instruments (e.g., well-chosen

sets of items that are administered in a survey) that are assumed to elicit behaviour (e.g., the

responses of a person to the items administered in the survey) that is representative of the

property of interest. The position of the person on the property is inferred from the response

behaviour of the person (e.g., Molenaar, 1995). In the psychometric literature, the phrase test

is often used when maximum performance is measured (e.g., as with educational testing and

intelligence testing) and the phrase questionnaire when typical behaviour is measured (e.g., as

with personality traits and attitudes). Because test has gained a wider use in psychological

measurement (e.g., Cronbach, 1971, p. 443; Murphy & Davidshofer, 1991, p. 8; Schouwstra,

2000, pp. 56-77) we prefer to use it also in this thesis for measurement instruments for typical

behaviour.

Validity of measurement is a key success factor in satisfaction research and in

marketing research in general. This is broadly acknowledged since the influential papers of

Jacoby (1976), Churchill (1979), and Peter (1981). First, academic studies in this domain

increasingly discuss the convergent, divergent, and nomological validity of measurements of

the constructs of interest. This is in accordance with suggestions by Cronbach and Meehl

(1955), Campbell and Fiske (1959), Churchill (1979), and Peter (1981). Second,

measurements of psychological constructs in academic marketing research are generally

based upon multiple-item instruments. This is in accordance with psychometric theory, which

postulates that single items often yield inadequate measurements of constructs (e.g., Messick,

1989, pp. 14, 35).

The interest in validity of measurement by no means implies that the issues with regard

to validity are resolved. A review of the marketing literature demonstrates a serious problem

regarding the definition and measurement of psychological constructs such as satisfaction

(e.g., Giese & Cote, 2000; Hausknecht, 1990; Peterson & Wilson, 1992; Yi, 1990). For

example, Verhoef (2001, p. 129) noticed that attribute-based measures of satisfaction differ

from affective measures of satisfaction, and that the latter measures of satisfaction have strong

resemblance with measures of affective commitment. Thus, different studies use different

labels for the same construct or use the same label for different constructs, and such

conceptual ambiguities slow down scientific progress.

The practice of validation of measurements of psychological constructs often is not

consistent with theory of validity, and has been criticised by validity theorists. This criticism

includes the practice of validation research in satisfaction studies. The assessment of

convergent, divergent, and nomological validity (Campbell & Fiske, 1959; Churchill, 1979;

Cronbach & Meehl, 1955) does not cover the major threats to construct validity, which are

construct underrepresentation and irrelevant variance (e.g., Messick, 1989, 1995; Schouwstra,

2000). Cronbach (1989) characterised most applications of the multitrait-multimethod design

(Campbell & Fiske, 1959) as mindless and mechanical, involving the collection of facts with

little concern for their usefulness for construct validation. Borsboom, Mellenbergh, and Van

Heerden (2004) criticised the practice of assessing nomological validity, and proposed to

assess validity on the basis of the test of a causal theory regarding the relation between the

property of interest and response behaviour.

Validity theorists (e.g., Anastasi, 1988; Borsboom et al., 2004; Messick, 1989;

Schouwstra, 2000) agree that construct validation has to start at the outset of test

development. This implies that the methodology of validation research should incorporate a

methodology of test development. The second objective of the present study is the selection

of a methodology for the development of a test for customer satisfaction and the validation of

test scores that is in line with validity theory.

5 Importance of satisfaction

Customer satisfaction is expected to influence customer behaviour, customer profitability, and

company profitability (e.g., Anderson & Mittal, 2000; Fornell, 1992; Oliver, 1997).

Therefore, customer satisfaction is considered of strategic importance for companies in many

retail markets, including the Dutch market for retail banking (e.g., Goedee, Reijnders, & Van

Thiel, 2008). During the present study, the Dutch market for retail banking was a mature and

competitive market. Most of the market was divided between six large retail banks. They all

offered a broad range of financial products, including current accounts, saving accounts,

credit cards, loans, mortgages, mutual funds, and insurances. A number of these products was

also offered by insurance companies and various niche players. Virtually each Dutch adult

owned at least a current account and most owned a variety of financial products. Most of

them had products from different financial companies.

Fornell (1992) argued that customer satisfaction is a key success factor for companies

that operate in mature and competitive markets. In these markets, company growth is

accomplished at the expense of competing firms, and retention of customers is of major

importance for companies in these markets (Fornell, 1992; Reichheld & Sasser, 1990).

Customer satisfaction is considered a key success factor for these companies, because it is

expected to affect retention of customers and to provide a defence against offensive strategies

by its competitors (Fornell & Wernerfelt, 1987, 1988).

Longitudinal studies (e.g. Anderson, Fornell, & Lehmann, 1994; Anderson & Mittal,

2000; Gruca & Rego, 2005) demonstrated a relation between customer satisfaction and future

financial results of companies. The results of these studies strengthen the expectation that

customer satisfaction influences customer profitability. If customer satisfaction influences

customer profitability, there must be a relation between customer satisfaction at time t = 0 and

customer profitability at time t > 0. However, longitudinal studies conducted at the person

level and exploring the relation between customer satisfaction and future customer

profitability, are rare in the marketing literature. Therefore, the third objective of this study is

to explore the latter relation on the basis of longitudinal data.

6 Research goal

Deficiencies in the definition and measurement of satisfaction have hampered the

development and validation of satisfaction theory (e.g., Giese & Cote, 2000; Peterson &

Wilson, 1992; Yi, 1990). The usefulness of satisfaction research for the development of

satisfaction theory may be increased by the resolution of these deficiencies. Because

psychometrics is concerned with the measurement of psychological constructs such as

satisfaction, psychometric methods may serve to overcome these deficiencies. This thesis

aims at contributing to the improvement of the methodology of satisfaction research by the

use of psychometric methods for the definition and measurement of customer satisfaction.

Furthermore, the thesis aims at contributing to the development and validation of satisfaction

theory by means of a study into the meaning of customer satisfaction in the context of retail

banking.

In order to meet the research goal, the thesis addresses four research questions:

1. What is a suitable methodology for test development and construct validation in the

domain of satisfaction research?

2. What is the theoretical meaning of customer satisfaction in the context of retail

banking?

3. What is the empirical meaning of customer satisfaction in the context of retail

banking?

4. What is the importance of customer satisfaction in the context of retail banking?

7 Contents of the thesis

This thesis encompasses three components, which are (a) a theoretical study into the

measurement of psychological constructs and the validity of measurement, (b) a theoretical

study into the meaning of customer satisfaction and customer dissatisfaction, and (c) two

empirical studies into customer satisfaction with a major Dutch retail bank. The empirical

studies were based on survey research that was conducted among customers of the bank.

Chapter 2 addresses the measurement of psychological constructs. The chapter starts

with an introduction into the conception of psychological constructs, the different approaches

to test development, and the measurement process. Subsequently, the theory of validity of

measurement is discussed. The chapter ends with the choice of the appropriate methodology

for test development and construct validation for this study.

Chapter 3 discusses the theoretical meaning of customer satisfaction. The chapter starts

with an exploration of the theory on customer satisfaction and customer dissatisfaction, and

the conceptions of satisfaction and dissatisfaction in these theories. Subsequently, the

nomological network of customer satisfaction is explored. On the basis of these explorations,

a definition of customer satisfaction in the domain of retail banking is provided.

Chapter 4 discusses the deductive design (Schouwstra, 2000). The chapter starts with an

explication of the deductive design, which is a methodology for test development and

construct validation for personality traits and attitude-like properties. Subsequently, the theory

of violators (Oort, 1996), the purpose of the empirical study, the development of the test for

customer satisfaction with a retail bank, the outline of the measurement model, and the

hypotheses regarding the validity of measurement of customer satisfaction are addressed.

The purpose of the first empirical study was to measure customer satisfaction with a

retail bank, to investigate the validity of the measurement of customer satisfaction, and to

explore the relation between customer satisfaction and future customer profitability. Chapter 5

addresses the method of the first empirical study. The chapter includes a discussion of the

measurement instruments that were applied in this study, the questionnaire, the pre-tests, the

pilot study, and the main study.

Chapter 6 presents the results of the first empirical study. The chapter starts with the

discussion of the preliminary data analyses. Subsequently, the measurement analyses and the

tests of the hypotheses are discussed. Next, the relation between customer satisfaction and

future customer profitability is further explored. The chapter concludes with a discussion of

the meaning of the results of the empirical study for the assessment of the validity of

measurement of customer satisfaction.

The purpose of the second empirical study was to test hypotheses regarding the validity

of measurement that were not addressed in the first empirical study. Chapter 7 addresses the

method of the second empirical study. The chapter includes a discussion of the measurement

instruments that were applied in this study, the questionnaire, the sample, and the data

collection.

Chapter 8 presents the results of the second empirical study. The chapter includes a

discussion of the preliminary analyses, the measurement analyses, and the tests of the

remaining hypotheses regarding the validity of the measurements of customer satisfaction.

The chapter concludes with a discussion of the meaning of the results of the study for the

assessment of the validity of the measurements of customer satisfaction.

Chapter 9 is the general discussion. It discusses the results from this study and their

implications for customer satisfaction theory and marketing measurement.

Chapter 2

Measurement of psychological constructs

1 Introduction

A psychological construct such as satisfaction is a theoretical construction with both linguistic

and empirical content. This means that a psychological construct is a term with (a) linguistic

meaning, such as any linguistic term, and (b) relations with empirical phenomena, that is,

observable behaviours. Constructs are highly similar to concepts, and to some extent both

terms may be used interchangeably. Hox (1997, p. 49) noted that both constructs and concepts

are theoretical abstractions, meaning that they represent ideas that are formed by

generalisations from similar phenomena, and that constructs refer to concepts that are sort of

formally defined in scientific theories. Thus, the term concept refers to a somewhat broader

group of theoretical abstractions than the term construct.

The major positions regarding the ontology of psychological constructs are realism

and constructivism (e.g., Borsboom, Mellenbergh & Van Heerden, 2003; Borsboom, 2005,

pp. 6-9). These two positions are discussed in the next section. A third position regarding the

ontology of constructs is operationalism (e.g., Borsboom, Mellenbergh & Van Heerden, 2003;

Borsboom, 2006). Operationalism equates theoretical constructs with their measurements.

Because it is broadly acknowledged that operationalism is untenable (e.g., Borsboom et al.,

2003; Heiser, 2006; Kane, 2006), this position is not discussed in this chapter.

In order to measure a construct one needs (a) to obtain a sample of instances within the

corresponding behavioural domain, (b) to assess whether these instances provide a

representative sample, and (c) to assess how to combine the observations into a measure of

the construct of interest. For the latter purpose one needs to apply statistical models and to

assess the quality of measurement. This is the major subject of psychometrics.

This chapter addresses the theory on the conception and the measurement of

psychological constructs. The focus of the chapter is on theory regarding the conception and

the measurement of attitude-constructs, such as satisfaction. Theory that is specific for the

conception and the measurement of ability-constructs, such as the various types of specific

intelligence, is not taken into account.

2 Conception of psychological constructs

Scientific concepts are the core of scientific theories (Sartori, 1984; Thomson, 1961). This

implies that psychological constructs are the core of psychological theories. Torgerson (1958,

p. 9) denoted psychological constructs such as satisfaction as property-constructs, and

contrasted them to system-constructs, which are objects and things that possess sets of

particular properties. He argued that, to be of use in scientific theory, a property-construct

must possess both theoretical meaning and empirical meaning (Torgerson, 1958, p. 11).

Whereas theoretical meaning refers to the definition of a construct in terms of theoretical

concepts, empirical meaning refers to the definition of a construct in terms of observable data.

The distinction between theoretical meaning and empirical meaning of constructs is founded

in linguistic and logical positivistic philosophies (e.g., Carnap, 1956; Frege, 1892).

A psychological construct such as intelligence has both theoretical and empirical

meaning. The theoretical meaning of intelligence entails its definition in terms of (a) the

group of attributes or phenomena to which it refers, and (b) its relation with other constructs

in the nomological network. The empirical meaning of intelligence entails the empirical

indicators of the construct and includes, for example, the score on a particular intelligence

test. However, as Torgerson (1958, p. 7) noted, the operationally defined intelligence is not

universally agreed to be the same thing as the theoretically defined intelligence. There is no

identity relation between the theoretical meaning of a construct and its empirical referents.

There is an ongoing debate regarding the ontological status of psychological properties

(e.g., Borsboom, 2005, 2006; Borsboom, Mellenbergh, & Van Heerden, 2003, 2004; Sijtsma,

2006). The major positions regarding the ontological status of psychological properties are

realism and constructivism. The realistic position is founded upon the assumption that

psychological properties exist as unobservable but real entities (e.g., Borsboom, 2005, p. 6).

This means that a property exists independent of its observations, and that the measurement of

a particular psychological property is a reflection of the entity. Borsboom et al. (2003) argued

that measurement of psychological properties requires a realistic position regarding the

particular construct, as the sentences Test X measures the attitude towards nuclear energy and

Attitudes do not exist cannot both be true.

The realistic position regarding the existence of psychological properties raises the

question What is property X, which is supposed to exist? Thus, it seems that the question

regarding the meaning of a particular psychological property precedes the question regarding

the ontological status of this property. The question regarding the meaning of a particular

psychological property is ultimately a linguistic question (Wittgenstein, 1953, 1958). This

means that the property is a term with a meaning that needs to be clarified on the basis of an

examination of the use of the term in linguistic contexts, including psychological theories.

Constructivism does not assume the existence of psychological properties as entities in a

realistic sense. According to constructivism, a psychological property may be conceived of as

an organisational principle with respect to behaviour. Borsboom (2005, pp. 7-9) differentiated

between three constructivist movements, which are logical positivism, instrumentalism, and

social constructivism. These movements have many different characteristics and concerns, but

what they have in common is (a) the differentiation between a theoretical concept and an

empirical concept, and (b) the denial of knowledge about the existence of theoretical concepts

as realistic entities, beyond their existence as organisational principles of behaviour.

Social constructivism deserves special attention because it advocates a linguistic

conception of psychological constructs, meaning that it is the linguistic use of the term that

grants theoretical meaning to the construct (Wittgenstein, 1958, 1953, section 43). This point

of view implies that the justification of a particular construct is founded in the use of the

construct within a particular language context, such as psychological theory. It makes sense to

question what a particular construct refers to, whether it is appropriate in a particular context,

and whether it is useful, but it makes no sense to question whether a particular construct exists

in any physical or physiological sense.

Empirical observations have to demonstrate the use of a construction in a particular

language context. According to Wittgenstein (1953, 1958), the description of particular cases

of a construction will reveal the meaning of the construction. It is fruitless to search for a

sharp definition of a construction like thinking, because cases of thinking are connected to

each other by family resemblances. There is no combination of defining characteristics, which

separates all cases of thinking from everything else. A sharp definition will not converge with

the actual use of a construct, because the actual use does not have distinct borders

(Wittgenstein, 1958, p. 44).

The linguistic conception of psychological properties does not defy the existence

psychological properties in a realistic sense, but it does defy knowledge beyond the

observable. This is best illustrated by the beetle argument (Wittgenstein, 1953, section 293):

‘Suppose everyone had a box with something in it: we call it a beetle – Here it

would be quite possible for everyone to have something different in the box. One

might even imagine such a thing constantly changing – But suppose the word

beetle had a use in these people’s language? – If so it would not be used as the

name of a thing. The thing in the box has no place in the language game at all; not

even as a something: for the box might even be empty. – No, one can divide

through by the thing in the box; it cancels out, whatever it is. – That is to say: if

we construe the grammar of the expression of sensation on the model of ‘object

and designation’ the object drops out of consideration as irrelevant.’

Sartori (1984) provided an important extension of the linguistic conception of social

science concepts. He acknowledged that social science theories are generally expressed in

natural language, which implies fuzzy reasoning, thinking, and operationalisation of concepts,

and that language influences our reasoning and theorising. He argued that science needs a

specialised language, which encompasses the unequivocal definition of its concepts. For this

purpose, Sartori (1984, pp. 31-35) proposed concept analysis. Concept analysis aims at

establishing the meaning of the concept by establishing the scientific definition of the

concept, making sure that the concept is understood unequivocally, and determining the

empirical referents of the concept. The core of concept analysis is the establishing of the

scientific definition of the concept. Sartori (1984, pp. 32-33) proposed to define a concept in

terms of a well-specified set of defining and accompanying characteristics. This is a verbal

definition. Concepts that have different connotations in natural language have to be split,

which results in unequivocally defined concepts. The empirical referents are loosely described

as the real world counterpart of words, which are the objects, entities, or processes denoted by

words. Sartori’s (1984) concept analysis bears resemblance to the explication of constructs

(Carnap, 1950, 1956).

The unequivocal definition of a construct is legitimate and desirable for the

development of scientific theory (Sartori, 1984; Torgerson, 1958). There is ample evidence of

a negative effect of conceptual ambiguities regarding constructs on scientific progress. See Yi

(1990) and Giese and Cote (2000) for discussion on the importance of an unequivocal

conceptualisation of satisfaction for the development of satisfaction theory. Concept analysis

is a useful starting point for research into social science concepts and marketing concepts,

because it may serve to overcome these conceptual ambiguities. However, unequivocal

definitions cannot bridge the gap between theoretical meaning and empirical meaning,

because the meaning of a term differs from the empirical referents (Frege, 1892; Wittgenstein,

1953, 1958). Theoretical constructs exist as linguistic constructions, and they have a surplus

meaning over any empirical meaning.

The constructivist position regarding the ontology of psychological properties is in line

with psychometrics, which is concerned with the modelling of data that reflects behaviour of

persons. This means that the latent trait in a measurement model is estimated from the data,

but it is not the attribute behind the data (e.g., Nunnally, 1978, p. 96, pp. 105-109; Sijtsma,

2006). Lord and Novick (1968, p. 352) explained that psychometrics does not assume the

existence of a property in a physical or physiological sense:

‘…nowhere in psychological theory is there any necessary implication that traits

exist in any physical or physiological sense. It is sufficient that a person behave as

if he/she were in possession of a certain amount of each of a number of relevant

traits and that he/she behaves as if these amounts substantially determined his

behaviour.’

Theory about psychological constructs has to take three points into consideration, which

originate from the conception of psychological constructs as linguistic constructions. First,

psychological constructs are terms that are used in different language contexts, such as

psychological theories. The linguistic use of the term is the first observable, and the analysis

of the use of the term reveals the meaning of the term. Second, psychological constructs may

have empirical referents, which are behaviours interpreted in terms of the construct. The

behaviours are the second observable, and they are the raw material for measurement. Third,

one cannot point to one particular kind of behaviour or one particular set of behaviours, which

totally cover a particular construct and nothing else. This means that a particular

psychological construct is connected to a domain of behaviours that cannot be delineated

sharply and cannot be listed exhaustively.

3 Test development

The development of scientific theory requires that its concepts can be measured adequately

(Sartori, 1984; Torgerson, 1958). Psychological constructs can be measured by means of

psychological tests (Chapter 1, Section 4). As a psychological construct is connected to a

domain of behaviours, one can hardly depend on the observation of one instance within a

domain in order to measure the construct. Moreover, Messick (1989) noticed that single items

yield moderate measurements of constructs because they almost certainly reflect a

confounding of multiple determinants. Consequently, the measurement of a psychological

construct on the basis of a single item will be biased. This problem is solved with multiple-

item scales, if the different items have different unique components that are mutually

independent.

Scientific research has suggested different methods for the development of

psychological tests. Oosterveld (1996, p. 25) categorised these methods in three approaches

for test development, which are the deductive approach, the intuitive approach, and the

inductive approach.

The methods of the deductive approach are based upon explicit theory about the

construct of interest. This theory is the basis of the formulation of a definition of the construct

and eventually the content of the items and the composition of the test (Oosterveld, 1996, p.

The methods of the intuitive approach are based upon implicit knowledge and implicit

hypotheses regarding the construct of interest. There is no theory regarding the construct of

interest that grounds the formulation of a definition of the construct and eventually the content

of the items and the composition of the test.

The methods of the inductive approach are exploratory. A test is developed on the basis

of observable relations between either the items or the items and some criterion. The methods

may be characterised as data driven, which means that the analysis of the available data

makes up the core of test development.

On the basis of empirical research into the quality of different methods, Oosterveld

(1996, p. 127) concluded that the deductive approach to test construction yields better tests

than the intuitive and inductive approaches. This means that the methods of the deductive

approach yielded tests that provided test scores having better validity and reliability than the

methods of the other approaches. Oosterveld (1996) studied two methods of the deductive

approach, which were the construct method (Jackson, 1971, 1973) and the facet design

method (Guttman, 1954). The methods can be described in terms of four components, which

are (a) the conception of the construct, (b) scale development, (c) scale construction, and (d)

evaluation of scale scores (Oosterveld, 1996, p. 24).

The construct method (Oosterveld, 1996, pp. 16-20) is a theory-oriented method. The

first step of the method is the definition of the construct on the basis of scientific theory

regarding the construct. The definition of the construct in terms of phenomena and attributes

that it refers to is called the explicit definition, and the definition of the construct in terms of

its relation with other constructs in the nomological network is called the implicit definition

(Schouwstra, 2000, p. 61). The second step of the method is elaboration or scale development.

This step includes item specification, item production, and item judgement. The items need to

be content saturated. This means that each item should correlate relatively high with the scale

score that represents the concept the item is expected to measure, and relatively low with

scale scores representing other concepts (Oosterveld, 1996, p. 19). Thus, each item must

possess convergent and divergent validity. The third step of the method is scale construction,

which refers to the application of a measurement model to the empirical data aimed at

producing a scale on which persons can be measured with respect to the concept of interest.

The fourth step of the method is the evaluation of the scale scores. This step includes, for

example, the assessment of reliability and construct validity of scale scores. It may be noted

that the construct method bears resemblance to Churchill’s (1979) procedure for test

development in marketing research.

Guttman (1954; see also Hox, 1997) introduced the facet design. The facet design

defines a universe of observations by classifying them with a scheme of facets (i.e., variables)

that contain different elements (i.e., values). Facet theory distinguishes three types of facets,

which are (a) population facets, which classify the population, (b) content facets, which

classify the concept, and (c) response facets, which classify the behaviours. Each of these

facets has one or more distinct values that are called the elements of the facet. The product of

all elements of all facets defines the universe of observations.

The facet design method (Oosterveld, 1996, pp. 20-24; Stouthard, Mellenbergh &

Hoogstraten, 1993) is a method for test development that is aimed at the optimisation of

content validity by means of a systematic representation of the concept. The concept is

represented on the basis of the combination of one or more content facets. Each content facet

has one or more elements, and a particular combination of elements of each content facet is

called a structuple (Oosterveld, 1996, p. 22). The product of all elements of all content facets

defines the set of structuples and delineates the concept (see, e.g., Section 4 from Chapter 4).

The second step of the method is elaboration or scale development. This step includes item

specification, item production, and item judgement. The items have to be derived from the

facet structure. Each item must be specific for a single structuple of the facet structure. The

third step of the method is scale construction. Scale construction refers to the analysis of the

data by means of a measurement model, aimed at producing the measurement scales and the

scale scores. The fourth step is the evaluation of the scale scores. This step includes, for

example, the assessment of reliability and construct validity of scale scores.

Both the construct method and the facet design method incorporate some kind of

concept analysis that clarifies the meaning of the construct of interest and facilitates its

definition. In the case of the facet design method, this analysis should facilitate a definition of

the construct in the format of a facet design, and in the case of the construct method this

analysis should facilitate an explicit and an implicit definition of the construct. However, it is

not immediately clear what this concept analysis is, that reveals the meaning of the construct

and facilitates its definition. Wittgenstein (1953, 1958, p. 44) argued that it is the examination

of examples of the use of a term in language contexts that reveals the meaning of the term.

Following this argumentation, it is appropriate to examine the use of the term in various

language contexts, including scientific theories, in order to clarify the meaning of the

construct and to develop a research definition of the construct. In practice, this requires the

inventarisation of diverse studies into the construct, and the examination of the conception of

the construct in these studies. See Giese and Cote (2000) for an example of this practice in

consumer satisfaction research; that is, the examination of definitions of consumer satisfaction

in scientific research, the analysis of similarities and differences between these definitions,

and the introduction of a framework for the development of context-specific definitions of

consumer satisfaction.

4 Measurement process

Coombs (1964, p. 4) represented the process of psychological measurement in a scheme

(Figure 1). The observations Coombs (1964) referred to are observations of behaviour, and

the data are psychological data. In phase one of the process, the researcher has to decide on

the collection of observations. The universe of observations is theoretically unlimited, and it is

up to the researcher to choose and to record particular observations from a particular research

population. In phase two, the researcher transforms the observations into data. It always takes

some decision or action on the part of the researcher to create the data on the basis of his/her

observations. Therefore, Coombs (1964, pp. 3-6, 29) conceived of data as interpretations of

observations by the researcher. In phase three, the researcher applies a measurement model to

the data in order to construct one or more scales, and to classify the stimuli and/or the persons.

A scale represents a property, and the classification of stimuli and/or persons on a scale

constitutes the measurement of a property. Thus, it is properties of stimuli and/or persons,

which are measured, and it is stimuli and/or persons, which are classified (Torgerson, 1958, p.

Universe of Recorded Data Inferential potential observations classification of observations individuals and stimuli

Phase 1 Phase 2 Phase 3

Figure 1: The Measurement Process (Coombs, 1964, p. 4)

Figure 1 illustrates that the scaling analyses are not at the core but at the end of the

measurement process. Coombs (1964, p. 5) argued that the phases preceding the scaling

analyses are at least as important components of the measurement process. Furthermore, the

scheme illustrates that each phase encompasses one or more decisions made by the researcher,

which influence the output of the phase concerned and the measurements. For example, the

researcher may code the answers to some closed question as nominal data, ordinal data, or

numerical data, and use a suitable measurement model to analyse the data. The coding of the

responses and the choice of the measurement model are based upon assumptions made by the

researcher with respect to the observations that he or she made. For this reason, Coombs

(1964, p. 5) noted that ‘psychological data and measurements and scales are theory’.

Psychometrics suggests different measurement models that may be applied in the last

phase of the measurement process. The major types of measurement models are the classical

test theory (CTT) model (Lord & Novick, 1968), the item response theory (IRT) models (e.g.,

Embretson & Reise, 2000) and the factor analytic models (e.g., Bollen, 1989; Gorsuch, 1983).

It is noteworthy that different measurement models may yield different scales of the property,

which means that they may yield different classifications of persons. The choice of a

researcher for a particular measurement model may be based on the hypothesised relationship

between the data and the property, the desired level of measurement, and the intention to test

hypotheses about the fit of the model.

The quality of measurement is not self-evident but has to be demonstrated. The major

criteria with respect to quality of measurement are the fit of the measurement model, the

reliability of the scale scores, the generalisability of conclusions, and the validity of the

interpretation of the scale scores (Molenaar, 1995).

The first criterion is the fit of the measurement model. The measurement model is a

formal representation of the expected data structure. The fit of the model refers to the extent

to which the theoretical assumptions of the model regarding the structure of the data match

the empirical data. This is, for example, the extent to which the theoretical correlation matrix

that is based upon the scale scores is in agreement with the empirical correlation matrix, or the

extent to which a theoretical assumption such as unidimensionality is in agreement with the

dimensionality of the empirical data. A major advantage of IRT models such as the Mokken

model (Mokken, 1971) and the Rasch model (Rasch, 1960) is the availability of powerful

tests of the fit of the model to the data (Molenaar, 1995). Since these models imply testable

statements regarding the structure of the data, their fit can be falsified on the basis of

empirical data.

The second criterion is reliability, which refers to the accuracy of scale scores. The

reliability coefficient originated from CTT, and is defined as the ratio of the true score

variance and the observed score variance in the population of interest. Neither the true scores,

which are defined as the observed scores minus the measurement errors, nor the true score

variance can be observed. Therefore, the reliability coefficient has to be estimated by other

means, such as the internal consistency coefficient, which is known as coefficient alpha

(Cronbach, 1951). The reliability coefficient is generally used to obtain the standard error of

measurement in scale scores. The standard error of measurement is used to estimate a

confidence interval for a person’s true score, and can be used for testing hypotheses about the

true score. For example, it can be tested whether two scale scores, which serve as estimates of

the true scores, are different, or whether a scale score is significantly different from a cut

score.

In IRT, an item response function is defined for each item in the test. For a particular

item, the item response function defines the probability of a particular score given the

person’s measurement value on the scale of interest. Thus, persons with different

measurement values have different probabilities of providing a particular score. An example

is an item response function that defines the probability of a correct answer to a particular

arithmetic item as an increasing function of arithmetic ability. Persons having higher

arithmetic ability levels have higher probabilities of giving the correct answer. The use of

item response functions implies that the magnitude of the measurement error depends on the

person’s location on the scale. Thus, one person may be measured with greater accuracy using

a particular item and a particular test than another person who has another scale location

(Molenaar, 1995).

The third criterion is generalisability, which refers to the extent to which conclusions

from measurement analyses are generalisable over various conditions. To assess the

generalisability of conclusions, one has to study the sources of randomness in measurement

(Molenaar, 1995). Major sources of randomness are (a) the sampling of persons, (b) the

sampling of items, (c) the test conditions, and (d) the mode of administration of the test. For

example, due to differences in test conditions (e.g., Messick, 1989, p. 81) a set of items may

constitute a scale in one empirical study but not in another empirical study. This necessitates

the assessment of the fit of the measurement model in different empirical studies in which the

measurement instrument is used. Furthermore, the mode of administration may influence the

responses to test items. For example, results obtained via telephone interviews cannot be

compared with results obtained from on-line interviews without having investigated the

comparability of these modes of data collection (e.g., Bronner & Kuijlen, 2007). It is

recommended to reflect on the plausible sources of randomness in advance of a study and, if

necessary, to test empirically whether particular generalisations are justified (Molenaar,

1995).

The fourth criterion is validity. Messick (1989, p. 13) defined validity as ‘an integrated

evaluative judgement of the degree to which empirical evidence and theoretical rationales

support the adequacy and appropriateness of inferences and actions based on test scores or

other modes of assessment’. This definition entails validity of measurement (i.e., the validity

of test-score interpretations for describing a person; Cronbach, 1971, pp. 445-449) and

validity for decision-making (i.e., the validity of test-score interpretations for making

decisions about a person; Cronbach, 1971, pp. 445-449). Validity is extensively discussed in

the next session.

5 Validity

The concept of validity has evolved throughout time (e.g., Anastasi, 1986; Angoff, 1988;

Schouwstra, 2000). Initially, validity was conceived of as the degree to which a test measures

what it purports to measure (Kelley, 1927). Validity was demonstrated on the basis of the

correlation of test scores with some criterion, which is called criterion-related validity (e.g.,

Anastasi, 1988, p. 145; Cronbach & Meehl, 1955). However, it proved to be difficult to find

objective criteria for different kinds of measurements, such as measurements of different

psychological constructs. This problem gave rise to new methods for establishing validity and

eventually to different conceptualisations of validity, such as (a) criterion-related validity, (b)

content validity, and (c) construct validity (Cronbach & Meehl, 1955).

Content validity is established by showing that the behaviours sampled by the test are a

representative sample of the domain of interest (e.g., Anastasi, 1988, p. 140; Cronbach, 1971,

p.451; Messick, 1989, pp. 39-42; Murphy & Davidshofer, 1991, pp. 107-109). As such,

content validity pertains to evidence about the domain coverage and the degree to which the

content of the test represents the domain. In order to establish content validity, one must

depart from an elaborated definition of the construct of interest. This definition should include

a detailed description of what the construct refers to, and of what the construct does not refer

to but may be related to (Schouwstra, 2000). Content validity is then established on the basis

of the comparison of the structure of the test with the specified structure of the construct.

Thus, content validity is a property of tests rather than of test-score interpretations (Messick,

1989, p. 17)

Two additional remarks are in order. First, content validity has to be incorporated at the

onset of test development. For example, Messick (1989, p. 39) noted that, on the basis of the

construct definition, a researcher can develop a test which covers all aspects or facets of the

construct of interest according to a specified rule such as equal coverage, which means that all

aspects or structuples are equally represented in the test. This is the core of content validity.

Second, content validity should not be confused with face validity. The latter pertains to

whether the test looks valid to test users, and not to what the test scores actually reflect

(Anastasi, 1988, p. 144). Therefore, validity theorists do not consider face validity as a

conceptualisation of validity.

Cronbach and Meehl (1955) conceived of construct validity as the appropriateness of

test-score interpretations. They discussed construct validation, and they concluded that

construct validation may include many investigations, such as research into content validity,

criterion-related validity, inter-item correlations, and inter-test correlations. Furthermore, they

proposed defining a construct by means of a network of associations or propositions in which

the construct of interest occupies a central position. This network is the nomological network.

The study of relations between test scores and measurements of concepts in the nomological

network provides evidence pro or contra construct validity. Construct validation requires the

integration of all evidence into a judgement of construct validity. Because this judgement is

qualitative by nature, it cannot be expressed as a single coefficient, such as the reliability of

test scores (Cronbach & Meehl, 1955).

One additional remark is in order. Cronbach & Meehl (1955) explained construct

validation, and their explanation illustrates that they conceived of construct validity as the

appropriateness of test-score interpretations (see also Cronbach, 1971, p. 447). However, they

did not provide an explicit definition of construct validity. The lack of an explicit definition

may have contributed to confusion about the meaning of construct validity. For example,

Churchill (1979) conceived of construct validity as a property of a test, which does not match

the conception of construct validity as the appropriateness of test-score interpretations.

Churchill (1979) and Peter (1981) introduced construct validity in the marketing

literature. The work of these authors has guided validation research in academic marketing

research up to the present day. Elaborating on the work of Cronbach and Meehl (1955) and

Campbell and Fiske (1959), they split construct validity into (a) nomological validity, (b)

divergent validity, and (c) convergent validity. Nomological validity refers to the relationships

between the test scores and measures purported to assess different but related concepts.

Discriminant or divergent validity refers to the extent to which test scores differ from

measures of other concepts that are expected to be different from the concept of interest in

theoretically interesting ways. Convergent validity refers to the extent to which test scores

correlate with other measurements of the same construct.

Churchill (1979) and Peter (1981) proposed multitrait-multimethod (MTMM) research

(Campbell & Fiske, 1959) to investigate construct validity. MTMM research requires

measurements of at least two traits by at least two methods, so that each trait is measured by

each method. The MTMM matrix consists of the correlations between (a) the same trait

measured by means of different methods, (b) different traits measured by means of the same

method, and (c) different traits measured by means of different methods. Convergent validity

is assessed on the basis of inspection of the first set of correlations, divergent validity is

assessed on the basis of inspection of the second set of correlations, and method bias is

assessed on the basis of a comparison of the second and the third set of correlations.

Belson (1986) explicitly addressed the subject of validity in survey research. The

measurement of psychological constructs is typically based upon survey research. Thus, the

quality of the survey data delineates the validity of measurements of psychological constructs.

Belson (1986) noted that the accuracy of answers to survey questions cannot be taken for

granted because misinterpretations of questions, memory decay of participants, and

unwillingness to respond may contaminate the data. Ample evidence exists of the effects of

questionnaire format, questionnaire length, and the wording of questions and response

categories on the responses of participants to questions (e.g., Belson, 1981; Bradburn, 1983;

Rugg, 1941; Schuman & Presser, 1981; Sudman & Bradburn, 1982; Saris et al. 1998;

Scherpenzeel, 1995). For these reasons, Belson (1986) proposed assessing validity in survey

research on the basis of an investigation of the quality of the answers given to survey

questions. This includes the investigation of the quality of opinion data. Belson (1986)

proposed various techniques to assess the validity of survey research, such as (a) the

evaluation of the data collection procedure in terms of the known principles of question

formulation and questionnaire design, (b) the pre-testing of the questions, and (c) the

execution of a pilot of the questionnaire.

Messick’s (1989, p. 13) definition of validity is important for various reasons. First, the

definition expresses unequivocally that the subject of validation is the interpretation and the

use of test scores. This is in agreement with the practice of validation in psychological

research, which is to investigate the meaning of test scores in a specific context and the

usefulness of test scores for various decision-making purposes (e.g., Anastasi, 1988;

Cronbach, 1971; Murphy & Davidshofer, 1991). Second, the definition expresses that

different lines of evidence have to be considered when making a judgement of validity. This

includes evidence of criterion-related validity, content validity, and the original conception of

construct validity (Cronbach & Meehl, 1955). Third, the definition expresses that these

different lines of evidence cannot be integrated into a single coefficient, but have to be

integrated into a judgement regarding the test-score interpretation (e.g., Cronbach, 1971, p.

464; Cronbach & Meehl, 1955; Messick, 1989, 1995). This judgement has a gradual nature

(Messick, 1989, p. 13), which implies that the test-score interpretations may have high

validity, moderate validity, low validity, or no validity at all. Fourth, the definition expresses

that validation is an unending process that includes the judgement of evidence gathered in the

processes of test development and test use (Anastasi, 1986, 1988; Cronbach, 1971, p. 452;

Messick, 1989, p. 13). Fifth, Messick (1989, pp. 20-21) differentiated between the assessment

of construct validity and the assessment of the consequences of the use and the interpretation

of test scores as the two bases of validity. In that context, Messick (1989, pp. 20-21, 34; 1995)

argued that construct validity comprises the rationales and evidence supporting the

trustworthiness of test-score interpretations in terms of the construct of interest, and that the

validation of decision-making practices of test scores comprises the appraisal of social

consequences of the use and interpretation of test scores.

Messick (1989, pp. 34-35, 1995; see also Cook & Campbell, 1979) addressed two

general threats to construct validity, which are construct underrepresentation and irrelevant

variance. Construct underrepresentation refers to the risk of measuring only a part of the

construct of interest, such as only the cognitive aspect of customer satisfaction instead of both

the cognitive and affective aspects of the construct (e.g., Oliver, 1997, p. 343). Irrelevant

variance refers to the risk of measuring more than just the construct of interest, such as other

traits, concepts related to specific group membership, or response tendencies. Both construct

underrepresentation and irrelevant variance refute the interpretation of test scores in terms of a

reflection of the construct of interest and nothing else. It may be noted that the common

practice of collecting empirical evidence for a network of associations between measurements

does not exclude the two threats to construct validity. When a relationship is found between

the measure of the attribute and other attributes, the test score may still reflect only part of the

attribute. Also, the test score may reflect something more than just the attribute of interest.

Messick’s (1989, 1995) conception of construct validity as a property of test-score

interpretations is today’s dominant conception of construct validity in psychometrics.

However, it does not provide a clear-cut methodology for investigating construct validity. For

this purpose, Schouwstra (2000, pp. 58-59) proposed the deductive design, which is a

methodology for the development of tests for typical behaviour such as behaviour related to

satisfaction and construct validation. The deductive design is consistent with Messick’s

conception of construct validity. Schouwstra’s methodology encompasses the collection of

theoretical and empirical evidence regarding the interpretation of test scores in terms of the

construct of interest, and nothing else. As such, it takes the two global threats to construct

validity into account, which are construct underrepresentation and construct irrelevant

variance.

Borsboom et al. (2004) criticised Messick’s (1989, p. 13) definition and conception of

validity. They subscribed to Kelley (1927) that a test is valid if it measures what it purports to

measure, and they defined validity of tests accordingly: ‘A test is valid for measuring an

attribute if (a) the attribute exists in the real world, and (b) variations in the attribute causally

produce variations in the outcomes of measurement procedures’. Thus, Borsboom et al.

(2003, 2004) defined validity as a property of tests, and they took a realistic stance regarding

the nature of psychological constructs. They opposed Cronbach and Meehl (1955) and

Messick (1989, 1995), who conceived of construct validity as a property of test-score

interpretations, and conceived of psychological constructs as postulated attributes of people.

Two additional remarks are in order. First, Borsboom et al. (2004) argued that the

conception of validity as a property of tests has direct relevance for validation research.

Evidence of validity should be based upon research into the response process, that is, the

relation between the attribute and response behaviour. The research should test a hypothesis

with respect to the processes that lead to measurement outcomes. This amounts to a test of a

causal theory about the relation between attribute and response behaviour. Because a

nomological network is not a theory of the causal relation between attribute and test score, the

authors considered the nomological network irrelevant for validation research. Thus, in their

view validation research should not assess the relationship of the construct with other

constructs in the nomological network, but test a causal theory about the processes that evoke

behaviour.

Second, Borsboom et al. (2004) argued that the conception of validity as a property of

tests has direct relevance for test construction. A large part of test validity research has to be

done at the stage of test construction. Test development should depart from a theory on the

causal relation between the attribute and behaviour. This approach to test development has

been applied successfully with respect to measurement of some specific ability constructs,

such as transitive reasoning (Bouwmeester & Sijtsma, 2006) and cognitive development

(Jansen & Van der Maas, 1997).

6 Discussion

There is no broad consensus on either the conception of validity or the methodology of

validation research. This is due partly to different conceptions of validity being based upon

different conceptions of psychological constructs, and partly to validity theory that is still

developing and has not yet come to a conclusion. We discerned three perspectives on validity

and validation research that are important for current academic research: (a) the Churchill

perspective, (b) the Messick perspective, and (c) the Borsboom perspective. These

perspectives are presented in Table 1.

The Churchill perspective on construct validity is the leading perspective on construct

validity in academic marketing research. It was introduced in Churchill’s (1979) procedure for

test development in marketing research. Peter (1981) and Fornell and Larcker (1981) further

elaborated Churchill’s perspective and the associated methods for validation research.

Churchill’s procedure for test development in marketing research has contributed

markedly to the measurement of psychological constructs in the corresponding domain (e.g.,

Bearden, Netemeyer, & Mobley, 1993), but Churchill’s perspective on construct validity is

not in line with modern theories of construct validity. The criteria associated with Churchill’s

perspective do not address the two global threats to construct validity, which are construct

underrepresentation and construct irrelevant variance (Messick, 1989, 1995; Schouwstra,

2000). Consequently, the methods associated with this perspective do not suffice for the

assessment of construct validity.

Table 1: Three Perspectives on Validity and Validation Research Churchill perspective Messick perspective Borsboom perspective

Theoretical

foundation

Constructivism Constructivism Realism

Conception Construct validity is a

property of tests

Construct validity is a

property of test-score

interpretations

Validity is property of

Criteria Convergent validity

Divergent validity

Nomological validity

Quality of construct

representation

Absence of irrelevant

variance

Test of causal theory

Prototypical

design

MTMM design

Correlation with criterion

Deductive design Experimental design

Outcome Gradual judgement of

validity

Gradual judgement of

validity

Binary judgement of

validity

First, content validity receives insufficient attention in Churchill’s perspective on

construct validity. Moreover, content validity was confused with face validity (e.g., Churchill,

1979; Bearden et al. 1993, p. 3). This may be considered the major flaw of the Churchill

perspective, because face validity only provides intuition for a particular interpretation of

what the test measures. Instead, empirical evidence is needed to support construct validity.

Such evidence comes from the investigation of the fit of the measurement model, the

plausible sources of measurement bias, and the nomological network of the construct. The

investigation of a test’s content validity adds to the process of construct validation in that it

provides evidence whether the item set used in the test is representative for the hypothetical

domain of items used to operationalise the attribute (e.g., Messick, 1989, pp. 36-42).

Second, the practice of MTMM research does not generate strong evidence of construct

validity. This is partly due to the fact that MTMM research is not concerned with content

validity, and partly due to the lack of direction of how to choose appropriate traits and

methods in MTMM studies. For obtaining strong evidence of construct validity, it is

necessary that the traits chosen are clearly similar and that the methods chosen are clearly

different. For example, Anastasi (1988, p. 158) argued that the agreement between two

measures of the same trait that are obtained by maximally similar methods reflects reliability,

and that the agreement between two measures of the same trait that are obtained by maximally

different methods reflects validity. In general, the methods applied in MTMM studies are

quite similar (e.g., Byrne, 1989; Churchill & Supranant, 1982; Fornell & Larcker, 1981; Saris

et al., 1998; Scherpenzeel, 1995; Wirtz & Lee, 2003). As a consequence, the agreement

between different measures of the same trait mostly reflects reliability rather than validity.

The Messick perspective is the leading perspective on construct validity in psychology.

In this perspective, construct validity is conceived of as a property of test-score interpretations

(i.e., the appropriateness of the interpretations of test scores in terms of the construct of

interest; this is also labeled validity of measurement, validity of test-score interpretations, and

construct validity of test scores). The best argument in favour of this conception of construct

validity is that a test may yield valid measurements of the construct of interest in one context,

and invalid measurements of the construct in another context. Moreover, a particular

interpretation of a test score may be valid while another interpretation is invalid. Therefore it

is the test-score interpretation that needs to be validated, and not the test.

The Messick perspective matches the constructivist position regarding the ontology of

psychological constructs. This is a major virtue of the Messick perspective. Another major

virtue is that it can be put into action by the deductive design (Schouwstra, 2000). The

deductive design provides a methodology for validation research that addresses the two global

threats of construct validity, and that is in line with Messick’s conception of construct

validity. Also, the deductive design incorporates the rationales behind test development. This

is in agreement with the notion stipulating that construct validation starts with the process of

test development. For these reasons, we subscribe to Messick’s perspective on construct

validity and Schouwstra’s methodology for validation research.

The Borsboom perspective is important for several reasons. First, it advocates a theory-

driven approach to construct validation. Borsboom et al. (2004) rightly argued that construct

representation is at the core of validity, and that proof of construct representation is founded

in theory regarding the construct of interest. Second, Borsboom et al. (2004) demonstrated the

limited usefulness of investigating convergent, divergent, and nomological validity. They

rightly argued that the investigation of these types of validity is subordinate to other evidence

regarding construct representation, such as theory testing. Third, Borsboom et al. (2004)

recommended that one explicates and tests theories of response behaviour. This is a useful

suggestion, because there is ample evidence of the disturbing influence of method

characteristics on response behaviour (e.g., Belson, 1981; Belson, 1986; Bradburn, 1983;

Rugg, 1941; Schuman & Presser, 1981; Sudman & Bradburn, 1982; Saris et al. 1998;

Scherpenzeel, 1995). Fourth, Borsboom et al. (2004) criticised Messick’s (1989, p. 13)

definition of validity as a judgement instead of a property. We subscribe to that criticism

where it concerns construct validity, but not where it concerns validity for decision-making

uses of test scores.

Borsboom et al. (2004) subscribed to Kelley (1927) that a test is valid if it measures

what it purports to measure. Thus, the Borsboom perspective is characterised by the

conception of validity as a property of tests. This conception of validity is problematic,

because whether a test measures what it purports to measure does not depend exclusively on

the content of the test. It also depends on, for example, the administration of the test, the

population in which the test is used, and eventually on the research goal. Thus, a particular

test may measure what it purports to measure in one instance but not in another instance.

Consequently, validity has to be assessed with each administration of a test, and this justifies a

conception of validity as a property of test-score interpretations.

The major weakness of the Borsboom perspective is its foundation in a realistic

conception of properties, which causes three problems. The first problem pertains to the

meaning of the statement Property X exists. According to realism, the statement expresses that

property X exists as an entity, independent of the observations (Borsboom, 2005, p. 6). We

consider this interpretation inappropriate, because properties are organisational principles

through which we perceive and interpret the world. Some of these organisational principles

are useful because they have many empirical referents. An example is aggression. Other

organisational principles are less useful because they have few if any empirical referents. An

example is clairvoyance. Thus, we contend that the statement property X exists expresses that

property X exists as an organisational principle. The second problem pertains to the statement

that variations in the property cause variations in the outcomes of measurement procedures.

This statement cannot be tested because one cannot observe covariation between an

unobservable entity and its measurement. Thus, one cannot know whether this statement is

true. The third problem pertains to the definition of properties. Borsboom’s perspective

requires a well-specified theory on the relationship between the property and response

behaviour. The theory should specify the set of responses for each level of the property, how

responses vary if levels of the property vary, and which response patterns exist and which not.

This amounts to a definition of the property in terms of response patterns, but that cannot be

the meaning of the property.

The Borsboom perspective may suit abilities, such as transitive reasoning, for which the

meaning is close to its operationalisation. However, for psychological attributes such as

satisfaction the Messick perspective is to be preferred, because it is founded on a

constructivist position regarding the ontology of psychological properties.

7 Conclusions

1. A psychological construct is a theoretical concept with theoretical and empirical

meaning. There is, however, no identity relation between the theoretical meaning and

the empirical meaning. This means that a construct has a surplus meaning over its

empirical indicators.

2. The theoretical meaning of a construct is linguistic by nature. It is the linguistic use of a

construct that grants meaning to the construct, and it is the examination of the linguistic

use that demonstrates the theoretical meaning of the construct. This means that the

theoretical meaning of a construct should be studied by means of an examination of

various examples of the linguistic use of that construct.

3. The theoretical meaning of a construct encompasses (a) the group of attributes and

phenomena the construct refers to, and (b) the relation of the construct with other

constructs in a nomological network. The former component is expressed in the explicit

definition of the construct and the latter component in the implicit definition of the

construct (Schouwstra, 2000, p. 61).

4. The empirical meaning of a construct embraces a domain of behaviours that cannot be

delineated sharply and cannot be listed exhaustively. Nevertheless, the construct has to

be measured on the basis of different observations from this behavioural domain. The

sampling of these observations constitutes the first phase of the measurement process

(Coombs, 1964, p. 4).

5. The development and validation of psychological theory requires measurements of

constructs that are in line with their theoretical meaning. This supports a deductive

approach to test development, which means that the development of the test is based

upon a formal definition of the construct of interest.

6. The Messick perspective on construct validity corresponds best with the linguistic

conception of psychological constructs. In this perspective, construct validity is the

appropriateness of test-score interpretations in terms of the construct of interest.

7. The deductive design exemplifies how to validate measurements according to Messick’s

perspective. For this reason, we chose the methodology of the deductive design for test

development and construct validation in the empirical study (Chapter 4 onwards).

Chapter 3

The theoretical meaning of customer satisfaction

1 Introduction

In chapter 2, we concluded that the theoretical meaning of a construct is inherently linguistic,

and that it is the linguistic use of the term that grants meaning to the construct (Wittgenstein,

1953). For this reason, the theoretical meaning of customer satisfaction has to be clarified by

means of an examination of the linguistic use of the term. This is the examination of examples

of the linguistic use of the term in scientific studies as well as its use in everyday language.

In the present chapter, the theoretical meaning of customer satisfaction is investigated.

The investigation encompasses an examination of (a) conceptions of satisfaction, (b)

conceptions of dissatisfaction, (c) theories of satisfaction, (d) concepts in the nomological

network of satisfaction, and (e) measures of satisfaction in the marketing literature. Based on

the results of the investigation, the term customer satisfaction is explained and defined. The

explicit definition of customer satisfaction addresses the group of attributes and phenomena

that customer satisfaction refers to, and the implicit definition of customer satisfaction

addresses the connections of customer satisfaction with other concepts in a nomological

network.

2 Conceptions of satisfaction

A review of the marketing literature by Yi (1990) and Giese and Cote (2000) yielded a

multitude of definitions of consumer satisfaction, customer satisfaction, summary satisfaction,

and transaction-specific satisfaction. The different definitions of these terms reflect different

conceptions of satisfaction. In order to clarify the theoretical meaning of satisfaction, we

examined the major conceptions and the corresponding definitions of satisfaction in the

marketing literature.

The marketing literature distinguishes two important conceptions of satisfaction. The

first is satisfaction as a response to disconfirmation (Table 1, first column) and the second is

satisfaction as a valenced response to consumption (Table 1, second column). Both

conceptions can be applied to transaction-specific satisfaction (Oliver, 1997, p. 15), which

concerns satisfaction with single encounters with the focal object (Table 1, first row), and to

summary satisfaction (Oliver, 1997, p. 15), which concerns satisfaction with the accumulation

of encounters with the focal object (Table 1, second row). Each cell in Table 1 is associated

with several definitions of satisfaction, as they can be found in the marketing literature (e.g.,

Giese & Cote, 2000; Yi, 1990). Because the subject of this thesis is summary satisfaction with

a bank, we discuss both satisfaction as a response to disconfirmation and satisfaction as a

valenced response to consumption, and also the prototypical definitions of summary

satisfaction associated with each of the two conceptions of satisfaction.

Table 1: Conceptions of Satisfaction in the Marketing Literature Response to disconfirmation Valenced response to

consumption

Based on a single encounter with

focal object

Transaction-specific satisfaction Transaction-specific satisfaction

Based on accumulation of

encounters with focal object

Summary satisfaction Summary satisfaction

Satisfaction as a response to disconfirmation

Disconfirmation refers to the perceived discrepancy between pre-consumption expectations

and post-consumption perceptions. The conception of satisfaction as a response to

disconfirmation originated from disconfirmation theory (e.g., Oliver, 1980, 1997). According

to disconfirmation theory, the level of satisfaction (and also dissatisfaction) is a function of

pre-consumption expectations and disconfirmation of expectations. Whereas positive

disconfirmation of expectations contributes to satisfaction, negative disconfirmation of

expectations contributes to dissatisfaction. In the augmented disconfirmation theory, the level

of satisfaction is also a function of the perceptions of outcomes of consumption (Oliver, 1997,

pp. 119-121). The augmented disconfirmation model is represented in Figure 1.

Disconfirmation theory is the dominant satisfaction theory, and was investigated in

several studies (e.g., Churchill & Suprenant, 1982; De Ruyter, Bloemer, & Peeters, 1997;

Oliver, 1980; Oliver, 1997; Oliver & Burke, 1999; Oliver & DeSarbo, 1988; Tse & Wilton,

1988; Van Montfort, Masurel, & Van Rijn, 2000). Although these studies are not unanimous

with respect to the magnitude of the effects of expectations, perceptions, and disconfirmation

on satisfaction, there is evidence of the significance of each of these effects (e.g., Oliver,

1997; Oliver & Burke, 1999).

Expectations

Disconfirmation (Dis)satisfaction

Perceptions

Figure 1: The augmented disconfirmation model of satisfaction

The disconfirmation model has met with three important problems. The first problem

pertains to the use of pre-consumption expectations as the comparison standard for the

consumer’s post-consumption perceptions. Alternatives for this comparison standard are (a)

the ideals held by the consumer, (b) the needs of the consumer, and (c) standards concerning

fairness held by the consumer (Oliver, 1997, pp. 71-72, 133-134). Thus, there is no broad

consensus about the conception of disconfirmation. The second problem pertains to the

operationalisation of expectations. If one cannot get access to consumers before consumption

took place, it is not possible to measure pre-consumption expectations, and instead one can

only measure retrospective expectations at best. Because expectations may change during the

process of consumption, retrospective expectations may differ from the pre-consumption

expectations held by the consumer. The third and major problem pertains to the conception of

satisfaction as a response to disconfirmation (e.g., Bloemer, 1993, p. 93; Oliver, 1980; Tse &

Wilton, 1988). This conception disregards the content of the satisfaction response, which

should be the core of the explicit definition of the concept (e.g., Oliver, 1997, p. 13; Sartori,

1984, pp. 32-33; Schouwstra, 2000, p. 61).

The definitions of satisfaction associated with this theory define satisfaction in terms of

a response to disconfirmation. For example, Tse & Wilton (1988; also see Table 2) defined

consumer satisfaction/dissatisfaction as ‘the consumer’s response to the evaluation of the

perceived discrepancy between prior expectations (or some other norm of performance) and

the actual performance of the product as perceived after its consumption’. Bloemer (1993, p.

61; also see Table 2) defined satisfaction as the ‘outcome of the subjective evaluation that the

chosen alternative (the brand) meets or exceeds the expectations of the person’. It may be

noted that the subjective evaluation is the perceived discrepancy between prior expectations

and actual performance of the brand, and that the subjective evaluation results from the

processing of expectations and performance of the brand. Bloemer (1993, p. 93; also, see

Bloemer & Kasper, 1995; Bloemer & Poiesz, 1989) argued that the extent to which persons

process expectations and performances depends on both the motivation and the ability of the

person to do so. For this reason, she differentiated between latent satisfaction, which results

from a low degree of processing of expectations and performances, and manifest satisfaction,

which results from a high degree of processing of expectations and performances. Because

this differentiation is an elaboration of the conception of satisfaction, it is an important

extension of disconfirmation theory.

Satisfaction as a valenced response to consumption

The conception of satisfaction as a valenced response to consumption concerns the

satisfaction response to consumption experiences, and is therefore typical of consumer

satisfaction and customer satisfaction. Oliver (1997, p. 28) explained valence as ‘polarity, the

positivity or negativity of a state of nature’. Thus, a valenced response can be placed on a

dimension that ranges from negative to positive. A special case of the valenced response is the

neutral response. A neutral response to consumption is given when a person is neither

satisfied nor dissatisfied with his or her consumption experience. It may be noted that in the

conception of satisfaction as a valenced response to consumption, the satisfaction response is

distinguished from non-valenced responses (e.g., the propositions it is dark, and 2+2=4), and

valenced responses towards things, which were not consumed (e.g., a person’s judgement of a

car that he or she never drove).

The prototypical definitions associated with this conception of satisfaction are the

definitions provided by Howard and Sheth (1969), Fornell (1992), Oliver (1997), and Giese

and Cote (2000). There are important differences between these definitions. Howard and

Sheth (1969; also see Table 2) defined satisfaction as ‘the buyer’s cognitive state of being

adequately or inadequately rewarded for the sacrifices he or she has undergone’. This is the

’s re

s a re

r’s c

r a se

ity, (

prototypical definition of satisfaction as a cognition. Fornell (1992; also see Anderson,

Fornell, & Lehmann, 1994; also see Table 2) defined customer satisfaction as ‘an overall

post-purchase evaluation’. This definition was only applied with respect to summary

satisfaction, and it was the basis of several national customer satisfaction indices (Fornell,

1992; Johnson, Gustafsson, Andreassen, Lervik, & Cha, 2001).

Oliver (1997, p. 13) defined consumer satisfaction as ‘the judgement that a product or a

service feature, or the product or service itself, provided or is providing a pleasurable level of

consumption-related fulfilment, including levels of under- or overfulfilment’. This definition

requires an explanation. First, the definition expresses that satisfaction is a response to

fulfilment, which implies that it is evoked during or after consumption. Second, the term

judgement in the definition expresses that the satisfaction response is a valenced response.

Third, the term fulfilment in the definition expresses that a goal exists, that something needs to

be fulfilled. Fourth, the term pleasurable in the definition expresses that satisfaction includes

affects. This notion is in line with the results from recent studies into the nature of satisfaction

responses (e.g., Friman, 2004; Giese & Cote, 2000; Van Dolen, Lemmink, Mattsson, &

Rhoen, 2001; Wirtz & Lee, 2003).

Oliver (1997, pp. 318-319) noted that satisfaction responses may become manifest as an

affect (a pleasant or an unpleasant feeling), a cognition (a positive or a negative judgement),

or both. Whether the satisfaction response is manifested as an affect, a cognition, or both

depends on the person, the focal object, and the context. For example, satisfaction with the

postal services may become manifest in the form of cognitions, and satisfaction with dinner in

a restaurant may become manifest in the form of affects. Consequently, Oliver (1997, pp.

318-319) distanced himself from the view of satisfaction as anhedonic cognition. He

concluded that affects coexists alongside cognitive judgements in producing the satisfaction

response. This means that satisfaction may be manifested in affects as well as in cognitions.

Oliver (1997) demonstrated that satisfaction may arise from different processes, such as

performance evaluations, processing of expectations, disconfirmation of expectations, need

fulfilment, equity evaluations, cognitive dissonance, and processing of affects. Therefore he

concluded that satisfaction may become manifest in various responses. Oliver (1997, pp. 337-

342) suggested differentiating between four prototypical satisfaction responses, which he

labeled satisfaction-as-contentment, satisfaction-as-pleasure, satisfaction-as-delight and

satisfaction-as-relief. In some contexts, satisfaction may be manifested as the absence of

dissatisfaction (Giese & Cote, 2000; Westbrook & Oliver, 1991). In survey research in the

automotive industry, Westbrook and Oliver (1991) demonstrated that a large part of the

consumers was rather unemotional about their car. In general, these consumers responded

positively to satisfaction items, and negatively to dissatisfaction items. The authors argued

that in this consumer segment, satisfaction might be interpreted as the absence of

dissatisfaction. This implies, for example, that consumers remain satisfied until problems

occur that hamper consumption. According to Oliver (1997, p. 340), absence of

dissatisfaction is a special case of satisfaction-as-contentment.

Oliver (1997, p. 339) described the contentment satisfaction state as a passive response

to consumption that results when satisfaction states are maintained or prolonged. Contentment

satisfaction or latent satisfaction (Bloemer, 1993) appears to be a common meaning of

satisfaction in contexts that are characterised by stable consumption outcomes, such as the

consumption of postal services or of a long-lasting consumer durable. According to Oliver

(1997, p. 340), if a survey focuses on satisfaction in an ongoing-use situation, most persons

will be responding from a satisfaction-as-contentment state, and fewer persons will be

responding from a satisfaction-as-delight, satisfaction-as-pleasure, or satisfaction-as-relief

state.

Giese and Cote (2000) defined consumer satisfaction as ‘(a) an affective response of

varying intensity, (b) directed towards focal aspects of the acquisition and/or consumption of

products or services, and (c) determined at the time of purchase or temporal points during

consumption, and lasting for a finite but variable amount of time’. This is the prototypical

definition of satisfaction as an affect. Qualitative research in a sample of 158 persons (Giese

& Cote, 2000) demonstrated that 60 to 70 percent of the participants explained the term

satisfaction in terms of affect. This is an important result because it demonstrates the affective

content of satisfaction. Giese and Cote (2000) concluded that consumer satisfaction is an

affective response of a consumer towards some phenomenon. They argued that cognitions

may be at the basis of the formation of consumer satisfaction, but that these cognitions do not

constitute consumer satisfaction.

Giese and Cote also argued that the meaning of satisfaction is context-specific. There

are many contextual variables that affect how satisfaction is perceived, and these variables

differ over domains in reality. For example, satisfaction with a retail bank differs from

satisfaction with medical care or satisfaction with a sports car. Persons have different needs

and different expectations in different contexts, and these differences influence the meaning

of satisfaction in these contexts. Therefore, Giese and Cote (2000) concluded that the

definition and the measurement of satisfaction also are context-specific. They proposed a

framework for developing context-specific definitions of consumer satisfaction. In line with

their definition, the framework addresses three components of the definition of satisfaction.

These components are (a) the type of affective response, (b) the timing of the response, and

(c) the focus of the response. The framework should facilitate the development of context-

specific definitions of satisfaction and corresponding measurement procedures.

3 Conceptions of dissatisfaction

A major issue in satisfaction research, including satisfaction research in the marketing

domain, is the conception of dissatisfaction. The literature provides two stances regarding the

conception of dissatisfaction (Giese & Cote, 2000). Dissatisfaction is either considered to be

the opposite of satisfaction on a bipolar dimension (the one-factor theory; Figure2) or

satisfaction and dissatisfaction are viewed as two different dimensions (the two-factor theory;

Figure 2). The latter stance postulates that an individual can be simultaneously satisfied and

dissatisfied with a focal object (Yi, 1990). This means, for example, that one can be

simultaneously satisfied and dissatisfied with one’s car if, for example, the car is reliable but

does not accelerate well.

According to the one-factor theory, dissatisfaction is the opposite of satisfaction on a

bipolar dimension. This stance is reflected in, for example, Oliver’s (1997, p. 28) definition of

dissatisfaction as ‘the negative satisfaction state, when the consumer’s level of fulfilment is

unpleasant’. Thus, he considers dissatisfaction to be the opposite of satisfaction on a bipolar

dimension. It is noteworthy that the conception of dissatisfaction as the opposite of

satisfaction does not defy the possibility that a consumer is satisfied with one aspect of

consumption outcomes and dissatisfied with another aspect. However, it does defy the

possibility that a consumer is both satisfied and dissatisfied with one phenomenon at one

point in time.

According to the two-factor theory (Herzberg, Mausner, & Snyderman, 1959)

satisfaction and dissatisfaction have different antecedents, and should be conceived of as

independent dimensions. The notion that satisfaction and dissatisfaction have different

antecedents, results from research into phenomena that caused satisfaction responses and

phenomena that caused dissatisfaction responses (e.g. Herzberg et al., 1959; Johnston, 1995).

For example, Johnston (1995) reported that the phenomenon of helpfulness of a bank was a

determinant of satisfaction with a bank, and that the phenomenon of integrity of a bank was a

determinant of dissatisfaction with a bank. Similarly, Herzberg et al. (1959, pp. 72-74)

reported that the phenomenon of responsibility was a determinant of satisfaction with a job,

and the phenomenon of salary was a determinant of dissatisfaction with a job. The

phenomena that are expected to cause satisfaction responses are often labeled motivator

factors or motivators, and the phenomena that are expected to cause dissatisfaction are often

labeled hygiene factors or hygienes (e.g., Oliver, 1997, pp. 146-150; Wolf, 1970).

Two-factor theory One-factor theory

Satisfaction and dissatisfaction are unipolar constructs

Not satisfied Satisfied

Dissatisfaction is the opposite of satisfaction on a bipolar dimension

Not dissatisfied Dissatisfied

Satisfied Dissatisfied

Figure 2: Conceptions of satisfaction and dissatisfaction in the one-factor theory and the two-factor theory, respectively

The two-factor theory is disputable because empirical research demonstrated that a

phenomenon (e.g., magnitude of responsibility) can be a source of both satisfaction and

dissatisfaction (e.g, job satisfaction and job dissatisfaction; for an overview of empirical

studies into the two-factor theory, see Wolf, 1970; see also Oliver 1997, pp. 146–150). For

example, Soliman (1970) studied satisfaction and dissatisfaction of persons with their jobs,

and found that satisfaction and dissatisfaction were the opposite ends of a continuum.

Furthermore, Soliman (1970) found that when needs of a person were provided for

adequately, motivators were more important for satisfaction/dissatisfaction than hygienes, and

when needs of a person were provided for moderately, motivators and hygienes were equally

important for satisfaction/dissatisfaction. Eventually, Soliman (1970) concluded that the

effects of motivators and hygienes on satisfaction/dissatisfaction were dependent upon the

level of need fulfilment which was already accomplished. On the basis of a review of various

research findings, Wolf (1970) reached a similar conclusion.

Generalising the results of Soliman (1970) and Wolf (1970) implies, for example, that a

person’s satisfaction/dissatisfaction with his or her car depends on the level of need fulfilment

which was already accomplished. Assuming that the acceleration power of a car is a motivator

factor and that the reliability of a car is a hygiene factor, acceleration power of one’s car is

more important for satisfaction/dissatisfaction when the needs of a person are provided for

adequately, and reliability of one’s car is more important for satisfaction/dissatisfaction when

the needs of a person are provided for badly.

Russell and Carroll (1999a) investigated whether positive affect at some point in time is

the opposite of negative affect at that same point in time, or whether positive affect is

independent of negative affect. They defined a bipolar model of momentary affect, deduced

the theoretical correlations between positive affect measures and negative affect measures,

and compared these theoretical correlations with the empirical correlations observed in

various empirical studies (for an overview, see Russell & Carroll, 1999a). The authors

concluded that when controlling for the major factors that influence the correlation between

positive affect and negative affect, which are measurement error, item selection, and response

format, there was no basis for rejection of the bipolarity hypothesis. The more sources of bias

against bipolarity were removed the closer the data matched the bipolar model. Consequently,

Russell and Carroll (1999a, 1999b) concluded that the empirical evidence supports the

bipolarity hypothesis of momentary affect. It is plausible that this conclusion can be

generalised to satisfaction, and that dissatisfaction should be conceived of as the opposite of

satisfaction on a bipolar dimension. This is consistent with the dominant causal theory of

satisfaction, which is disconfirmation theory (e.g., Oliver, 1997; Tse & Wilton, 1988).

Generalising the results of Russell and Carroll (1999a, 1999b) to satisfaction and

dissatisfaction, a person’s simultaneous satisfaction with the reliability of his or her car and

dissatisfaction with its acceleration power does not imply that satisfaction and dissatisfaction

have to be considered two different dimensions. It implies that satisfaction/dissatisfaction is

assessed with respect to different attributes of the car and that, with respect to each attribute,

satisfaction is the opposite of dissatisfaction on a bipolar dimension. Thus, satisfaction with a

focal object can be conceived of as the opposite of dissatisfaction with the same focal object

(Oliver, 1997, p. 28).

4 The dual process model of satisfaction and dissatisfaction

Oliver (1997) proposed a model that describes how both a satisfaction response and a

dissatisfaction response may result from different psychological processes. This model is

denoted as the dual-process model (Oliver, 1997, p. 317), because it addresses two kinds of

processes, appraisal and non-appraisal of affects and cognitions, which may evoke a

satisfaction response. The satisfaction response may be manifested in the form of (a)

unappraised affects, (b) appraised affects, (c) unappraised cognitions, and (d) appraised

cognitions. Oliver conceived of unappraised affects and unappraised cognitions as the

immediate affects and the immediate cognitions that follow upon the experience of the focal

object. Appraised affects and appraised cognitions refer to affects and cognitions that have

been elaborated more intensively.

Satisfaction responses as unappraised affect refer to the immediate pleasure or the

immediate displeasure caused by consumption experiences. For example, an unappraised

affect is the immediate pleasure caused by smoking a cigarette. Satisfaction responses as

appraised affects result from the elaborations upon these affects. These elaborations include

the attribution of affects to a particular cause, and the evaluation of the value of the affect for

the individual. For example, the immediate reaction to smoking a cigarette may be the

experience of satisfaction and feelings of comfort, but the cognitive elaboration upon smoking

may yield feelings of doubt and eventually dissatisfaction. Unappraised cognitions are factual

cognitions regarding consumption outcomes, which are not further processed and do not raise

affects. The processes evoking unappraised cognitions account for the manifestation of

satisfaction as anhedonic cognitions; for example, noticing that one’s car functions well

without experiencing any feelings whatsoever (e.g., Oliver, 1997, pp. 318; Westbrook &

Oliver, 1991). Satisfaction responses as appraised cognitions result from elaborations of

cognitions resulting from consumption experiences, such as the satisfaction responses that

result from disconfirmation of expectations. For example, contrary to expectation one’s car

may not function well. The disconfirmation may evoke feelings of displeasure and eventually

dissatisfaction. The dual-process model is represented in Figure 3. It may be noted that

affects, cognitions, and satisfaction are psychological properties, and that consumption and

appraisal are activities.

The dual-process model accounts for different manifestations of satisfaction. First, the

process evoking unappraised affects accounts for the manifestation of satisfaction as an

affective response to consumption experiences. The conception of satisfaction as unappraised

affect is a special case of the manifestation of the satisfaction response according to the

definition of satisfaction by Giese and Cote (2000), which also includes affective appraisals of

cognitions. Second, the process evoking appraised affects accounts for the manifestation of

satisfaction as an overall evaluation. This manifestation of the satisfaction response may be

interpreted as a special case of the definition of satisfaction by Fornell (1992), which seems to

be focussed primarily at the cognitive evaluation of consumption experiences without

explicitly distinguishing immediate cognitions and elaborations of cognitions, but far less at

affects. Third, the process evoking unappraised cognitions accounts for the manifestation of

satisfaction as anhedonic cognitions (e.g., Oliver, 1997, pp. 318; Westbrook & Oliver, 1991).

Fourth, the process evoking appraised cognitions accounts for the manifestation of

satisfaction as a response to cognitions, such as the affective response to disconfirmation.

This manifestation of the satisfaction response is consistent with the definition of satisfaction

given by Giese and Cote (2000).

Affects

Consumption Appraisal (Dis)satisfaction

Cognitions

Figure 3: Dual-process model of satisfaction and dissatisfaction

T he dual-process model is in agreement with the conception of satisfaction as a

valenced response to consumption experiences, and with Oliver’s (1997, p. 13) definition of

satisfaction. Therefore, the dual-process model constitutes an important contribution to

satisfaction theory. However, two remarks are in order. First, according to the dual-process

model appraisal is either present or absent. This may be a simplification of reality, because

appraisal may be represented by a continuum ranging from absence of appraisal to presence

of appraisal. Second, the dual-process model does not express the conditions under which

appraisal is present or absent. Therefore, further research is needed to elaborate the model.

5 Concepts in the nomological network of customer satisfaction with a retail bank

This section addresses the nomological network of customer satisfaction in the context of

retail banking (Figure 4). The nomological network of a concept is the network of

associations of a concept with other concepts. The nomological network with respect to

satisfaction that is relevant in this study includes the concepts of trust, quality, loyalty, and

profitability. This nomological network is shown in Figure 4. The four concepts are (a)

considered important in the financial services industry, and (b) expected to be related to

customer satisfaction in this industry. According to many theorists (e.g., Hennig-Thurau,

Gwinner, & Gremler, 2002); Luo & Homburg, 2007; Oliver, 1997; Verhoef, 2001; Yi, 1990),

customer satisfaction is also related to concepts such as word-of-mouth, image, commitment,

marketing communication, retention, and cross-sell. Each of these concepts may be further

split up into part concepts. For example, image may be split up into corporate associations,

corporate image, and corporate reputation (e.g., Berens, 2004), and commitment may be split

up into affective commitment and calculative commitment (e.g., Verhoef, 2001). These

additional concepts were ignored in this study, because (a) trust, quality, loyalty, and

profitability were considered of primary importance to satisfaction research in the context of

retail banking, (b) inclusion of all concepts would introduce redundancy, such as the inclusion

of both loyalty (primary importance) and commitment (alternative concept), and (c) it was

anticipated that the measurement of all concepts in a survey would produce a questionnaire

that would be too long and ask too much time and effort of the participants of this study. Even

though one might argue that the alternative concepts also have a place in the nomological

network of satisfaction, we decided to leave then out to maintain a simple model tailored to

the practice of this study (Chapter 4 onwards).

First, the relationship between trust and customer satisfaction is discussed. Trust is

considered to be of major importance in retail banking, and has been shown to be related to

customer satisfaction (e.g. Hennig-Thurau et al., 2002; Singh & Sirdeshmukh, 2000; Verhoef,

2001). Trust is often seen as an antecedent of satisfaction (but for an exception, see Singh &

Sirdeshmukh, 2000); thus, in Figure 4 an arrow runs from trust to satisfaction.

Satisfaction

Profitability

Loyalty

Quality

Figure 4: Nomological network of satisfaction in the context of retail banking

Second, the relationship between quality and customer satisfaction is addressed. Quality

of products and services is considered to be of major importance in retail banking, and has

been shown to be related to customer satisfaction (e.g., Anderson et al., 1994; Cronin &

Taylor, 1992; Zeithaml & Bitner, 1996). Like trust, quality is often conceived of as an

antecedent of satisfaction but there seems to be more agreement among theorists with respect

to quality; thus, in Figure 4 the arrow runs from quality to satisfaction.

Third, the relationship between customer satisfaction and customer loyalty is addressed.

The relationship between these constructs has been demonstrated in various studies (e.g.,

Caruana, 2002; Oliver, 1999), and customer satisfaction is often conceived of as a necessary

although not a sufficient condition for customer loyalty (e.g. Gremler & Brown, 1996; Oliver,

1999). Therefore, in Figure 4 the arrow runs from satisfaction to loyalty.

Fourth, the relationship between customer satisfaction and customer profitability is

discussed. Longitudinal studies by Anderson et al. (1994), Anderson and Mittal (2004), and

Gruca and Rego (2005) have investigated the relationship between customer satisfaction and

future financial performance of companies. The results of these studies strengthen the

expectation that customer satisfaction influences customer profitability. In Figure 4, the arrow

pointing toward customer profitability shows the influence of customer satisfaction on

customer profitability.

Conceptions of trust

A review of the marketing literature yields two important conceptions of trust. The

expectations-conception of trust focuses on a person’s expectations with respect to an

exchange partner, while the behavioural-conception focuses on a person’s behavioural

intentions with respect to an exchange partner (Singh & Sirdeshmukh, 2000). An example of

an expectation is that a customer expects to be treated fair by the bank, and an example of a

behavioural intention is the customer’s intention to continue the relationship with the bank or

even expand the relationship, for example, by buying new products such as an insurance or a

mortgage in addition to a bank account. The major difference between these conceptions is

that the expectations-conception of trust does not include behavioural intentions in the domain

of trust, while the behavioural-conception of trust does.

Morgan and Hunt (1994) conceived of trust as existing when one party has confidence

in an exchange partner’s reliability and integrity. This is an expectations-conception of trust,

which is based upon Rotter (1967), who defined trust as a generalised expectancy held by an

individual that the word of another individual or a group can be relied upon. Following

Morgan and Hunt (1994), we defined trust as a person’s confidence in the reliability and

integrity of the company. This is a common definition of trust in the marketing literature (e.g.,

Verhoef, 2001, p.18), which we also adopt in this study (also, see Chapter 5).

Singh and Sirdeshmukh (2000) conceived of trust as a continuum that is bounded on one

side by a high level of trust and on the other side by a high level of distrust. The trust state and

the distrust state differ with respect to the valence of the expectations held by the person. It

may be noted that some authors suggested distinguishing between different dimensions of

trust, such as competence-trust and benevolence-trust (e.g., Singh & Sirdeshmukh, 2000), or

benevolence-trust and honesty-trust (e.g., Medlin & Quester, 2002). This stance implies that

each dimension of trust is bounded by a high level of trust on the one side and by a high level

of distrust on the other side. However, the dimensionality of trust is an empirical question,

and studies establishing the dimensionality of trust are rare (Singh & Sirdeshmukh, 2000) so

that definitive conclusions cannot be drawn. It may also be noted that empirical research

demonstrated a relation between expectations and customer satisfaction. This relation is

reflected in disconfirmation theory, in which expectations are conceived of as antecedents of

customer satisfaction (e.g., Oliver, 1997, Tse & Wilton, 1988). Because trust concerns a

person’s expectations regarding an exchange partner (Morgan & Hunt, 1994), trust may also

be conceived of as an antecedent of customer satisfaction (Singh & Sirdeshmukh, 2000).

In the financial services industry, trust is often conceived of as confidence in the

reliability and integrity of a company. This is in agreement with the expectations-conception

of trust, which is the common conception of trust in the marketing literature. Because persons

are expected to prefer a company they trust to companies they do not trust, trust is considered

an important success factor for companies in the financial services industry (e.g., Goedee,

Reijnders, & Van Thiel, 2008).

Conceptions of quality

There are two important conceptions of quality, which are objective quality and perceived

quality (Oliver, 1997; p. 162-166). Objective quality pertains to the extent that a product, a

service, or a process meets its technical specifications. It may be operationalised as the

number of failures of a product, a service, or a process (e.g., Garvin, 1983; Kackar, 1989, p. 6;

Woodall, 2001; because the number of failures is counter-indicative of quality, small numbers

of failures reflect high quality and large numbers of failures reflect low quality). Perceived

quality pertains to a person’s judgements of quality of products or services. It may be

operationalised on the basis of a questionnaire (e.g., Parasuraman, Berry, & Zeithaml, 1988;

Cronin & Taylor, 1992). Perceived quality is similar to perceived performance of products or

services, which is broadly conceived of as an antecedent of customer satisfaction (e.g., Oliver,

1997; Tse & Wilton, 1988; Yi, 1990).

The meaning of quality is context-specific. This implies that the definition and the

operationalisation of quality have to be adapted to the context and the purpose of a study. In

the present study, we defined quality as a person’s perceptions of the quality of attributes of

products and services provided by the company (also, see Chapter 5). Thus, in this study

quality was conceived of as perceived quality, which is in agreement with the conception of

quality in many studies (e.g., Grönroos, 1990; Zeithaml, Parasuraman, & Berry, 1990).

Furthermore, quality is established with respect to distinct attributes of products and services,

which corresponds with the suggestion of theorists (e.g., Anderson & Mittal, 2000; Zeithaml

et al, 1990; Zeithaml & Bitner, 1996) to distinguish different dimensions of quality. For

example, Zeithaml and Bitner (1996, p. 85) distinguished service quality, product quality, and

price quality as drivers of customer satisfaction. The combination of a customer’s positions on

these dimensions was expected to drive customer satisfaction.

Service quality has been studied extensively (e.g., Cronin & Taylor, 1992, 1994;

Grönroos, 1984, 1990; Parasuraman, Zeithaml, & Berry, 1985, 1988, 1994; Zeithaml &

Bitner, 1996; Zeithaml, Parasuraman, & Berry, 1990). These studies yielded several

measurement instruments for service quality, for example SERVQUAL (Parasuraman et al.,

1988) and SERVPERF (Cronin & Taylor, 1992). One remark is in order concerning these

instruments. SERVQUAL and SERVPERF were developed for the measurement of quality

across industries, but they were not customised for the measurement of quality in particular

industries, such as retail banking (e.g., Buttle, 1996; Coulthard, 2004; Newman, 2001; Oliver,

1997, p. 49). Therefore, the instruments may not cover all aspects of quality that are relevant

within a particular industry, and for that reason business researchers are required either to

customise these instruments to their research domain or to develop new measurement

instruments.

In the financial services industry, quality is broadly conceived of as a driver of customer

satisfaction (e.g., Goedee et al., 2008; Terpstra & Van Gastel, 2004). This is in accordance

with academic studies and theories (e.g., Caruana, 2002; Oliver, 1997; Van Montfort,

Masurel, & Van Rijn, 2000; Tse & Wilton, 1988; Yi, 1991; Zeithaml & Bitner, 1996). A

major part of in-company research in this industry is aimed at the assessment of distinct

dimensions of quality, and their relations with satisfaction. For this purpose, quality is mostly

operationalised on the basis of quality judgements by customers, regarding distinct attributes

of products and services.

Conceptions of customer loyalty

In present marketing theories, customer loyalty is conceived of as a psychological construct.

Gremler and Brown (1996, 1999) have defined loyalty to a service provider as ‘the degree to

which a customer exhibits repeat purchasing behaviour from a service provider, possesses a

positive attitudinal disposition towards the provider, and considers only this provider when a

need for this service arises’. This definition encloses three different aspects of loyalty, which

are (a) behavioural loyalty, (b) attitudinal loyalty, and (c) cognitive loyalty. Gremler and

Brown (1996) described the ultimately loyal customer as one who ‘regularly uses a service

provider, really likes the organisation and thinks very highly of it, and does not ever consider

using another service provider for this service’. This description of the loyal customer

includes an implicit comparison of the service provider with other providers (also, see Dick &

Basu, 1994). On the other end of this continuum is the ultimately non-loyal customer, who

may be described as one who does not regularly use a service provider, does not really like

the organisation, does not think highly of it, and considers using another service provider for

this service (Gremler & Brown, 1996). Gremler and Brown’s (1996, 1999) conception of

loyalty to a service provider is similar to Oliver’s (1997, 1999) conception of customer

loyalty in general.

Most theorists agreed that customer loyalty encompasses psychological aspects as well

as behavioural aspects (e.g., Dick & Basu, 1994; Gremler & Brown, 1996, 1999; Oliver 1997,

1999). Therefore, the construct has to be measured on the basis of a set of items that reflect

both aspects. Empirical research using measurement instruments of customer loyalty that are

composed of items reflecting psychological aspects and behavioural aspects of customer

loyalty (e.g., Caruana, 2002; Gremler & Brown, 1999), yielded unidimensional measurements

of customer loyalty. Customer loyalty has also been operationalised as an intention to

recommend the company to family, friends, or colleagues (e.g., Reichheld, 2006). Because of

three reasons, it is doubtful whether this was a proper operationalisation. First, the

operationalisation did not agree with the definitions of customer loyalty provided by Oliver

(1997, 1999) and Gremler and Brown (1996, 1999). Reichheld’s (2006) operationalisation

appears more consistent with conceptions of word-of-mouth, which is a concept that was not

investigated in this study. Second, the operationalisation ignored the general principle that

psychological constructs are best measured on the basis of multiple-item scales (e.g., Messick,

1989) Third, Terpstra (2006a) found indications that customers, who said they will

recommend a particular company to friends and family, often said they will recommend

competing companies. This seems to be inconsistent with customer loyalty.

In the financial services industry, customer loyalty is considered important for

commercial success of companies (e.g., Goedee et al., 2008). Customer loyalty is expected to

affect the behaviour of customers and ultimately their profitability. Furthermore, business

researchers in this domain broadly conceive of customer loyalty as a consequence of customer

satisfaction. This agrees with results from academic research (e.g., Caruana, 2002; Gremler &

Brown, 1996; Hennig-Thurau et al., 2002; Oliver, 1997, 1999).

Conceptions of customer profitability

Customer profitability is of major importance for all commercial companies in service

industries, including the financial services industry. Theorists suggested using customer

profitability for marketing decision-making and accounting (e.g., Cooper & Kaplan, 1991;

Mulhern, 1999; Niraj, Gupta, & Narasimhan, 2001). There are two important conceptions of

customer profitability, which are gross customer profitability and net customer profitability.

Gross customer profitability refers to the gross financial contribution of a customer to the

company in some period of time (e.g., Cooper & Kaplan, 1991, p. 469; Niraj et al., 2001). In

the context of retail banking, the gross financial contribution consists of interest profits and

provision profits (to be discussed in Chapter 5). Net customer profitability refers to the net

financial contribution of a customer to a company in some period of time. The net financial

contribution consists of the customer’s gross customer profitability in that period of time

minus the companies’ costs allocated to the corresponding customer in the same period of

time (e.g., Campbell & Frei, 2004; Cooper & Kaplan, 1991, p. 469; Mulhern, 1999; Niraj et

al., 2001; Pfeifer, Haskins, & Conroy, 2005).

Customer profitability is the resultant of customer behaviour, such as the acquisition

and use of products and services from the focal company. Because customers differ with

respect to their behaviour, they also differ with respect to customer profitability. Furthermore,

because a customers’ behaviour changes over time, customer profitability also changes over

time. For example, a customer who increases his or her business with the company will

become more profitable to the company than he or she was before.

In the financial services industry, customer profitability is the resultant of financial

behaviour. Because a customers’ financial behaviour is related to his or her financial means, a

customer’s profitability is also related to his or her financial means. Obviously, a customer

with large financial means may achieve higher customer profitability than a customer with

smaller financial means. The absence of data with respect to customers’ means, which in this

kind of research is more the rule than the exception, may complicate research into the

connection between customer satisfaction and customer profitability in the financial services

industry.

The operationalisation of customer profitability is context-dependent. For example, the

period of time may be a day, a month, a quarter of a year, or a year (e.g., Campbell & Frei,

2004). For example, due to the high purchase frequency, a two-week period may be sufficient

to reliably record customers’ purchase behaviour in a supermarket (a two-week period is

expected to cancel out highs and lows), but due to the much lower purchase frequency, at

least a one-year period may be required to reliably record customers’ purchase behaviour with

a retail bank. Therefore, a two-week period may suffice for the operationalisation of customer

profitability for supermarkets, while a one year period is required for the operationalisation of

customer profitability in retail banking.

We expected that customer satisfaction positively influenced a customer’s gross

financial contribution, but we held no expectation about the influence of customer satisfaction

on the costs associated with a customer. Therefore, we chose the gross customer profitability

conception of customer profitability for the present study. In agreement with this conception

of customer profitability, we defined customer profitability as the gross financial contribution

of a customer to the company in some period of time.

The influence of customer satisfaction on customer profitability

Customer satisfaction is broadly expected to influence customer profitability and company

profitability (e.g., Anderson et al. 1994; Anderson et al. 2004; Anderson & Mittal, 2000;

Fornell, 1992; Gustafsson et al., 2005; Homburg et al., 2005; Mittal & Kamakura, 2001;

Oliver, 1997; Rust & Zahorik, 1993). This is an important reason for the interest in customer

satisfaction in various industries, including the financial services industry.

If customer satisfaction (denoted by CS) influences customer profitability (denoted by

CP), there must be a relation between customer satisfaction at time t = 0 and customer

profitability at time t > 0 (e.g., Ittner & Larcker, 1998). Then a model for the relation between

customer satisfaction (denoted CSt=0), other independent variables (denoted Xi), and future CP

(denoted CPt>0) is:

εγβα ++++= ∑=> iitt XCSCP ...00 .

The model was based on Ittner and Larcker (1998). The exact specification of the model is

context-dependent (see Chapter 6).

Henceforth, customer profitability at time t > 0 is labeled future customer profitability,

because it is the customer profitability at a point in time after the measurement of customer

satisfaction. Current customer profitability is measured at time t = 0 and customer satisfaction

is measured at time t = 0.

It is plausible that the effect size of customer satisfaction on future customer

profitability depends on characteristics of customers and markets, such as involvement of

customers and the availability of alternatives in the market. Fornell (1992) hypothesised that

customer satisfaction affects the commercial success of companies that operate in mature and

competitive markets. Therefore, we expect that in retail banking industries in mature markets,

customer satisfaction has a significant positive effect on future customer profitability.

Various studies (Anderson et al., 1994; Anderson & Mittal, 2000; Gruca & Rego, 2005;

Ittner & Larcker, 1998) demonstrated a relationship between customer satisfaction and

company profitability after one year. Ittner & Larcker (1998) also demonstrated a relationship

between customer satisfaction and customer profitability after one year. In-company research

(Terpstra, 2005, 2008) demonstrated a relationship between customer satisfaction and

customer profitability after 15 months. Therefore, we expect that in retail banking the

influence of customer satisfaction on customer profitability is manifest after one year.

Former studies in the financial services industry (e.g., Campbell & Frei, 2004; Terpstra,

2005, 2006b) demonstrated that current customer profitability is the major determinant of

future customer profitability. This relationship may be due to, for example, inertia of

customers, and the relationship between current customer profitability and the financial means

of customers. For these reasons, we consider current customer profitability an indispensable

variable in the model of the relation between customer satisfaction and future customer

profitability in retail banking.

6 Measures of satisfaction

Many measures of satisfaction have been reported in the marketing literature (e.g.,

Hausknecht, 1990; Peterson & Wilson, 1992; Westbrook & Oliver, 1981; Wirtz & Lee, 2003).

Hausknecht (1990) listed 34 measures (i.e., operationalisations) of satisfaction, which were

used in satisfaction research. The list included behavioural measures (i.e., registrations of

behaviour, such as number of complaints about the focal product) and self-report measures

(i.e., survey items, such as rating scales). The self-report measures differed with respect to the

number of items included (varying from one to six items), the format of items (verbal items,

graphical items, and items reflecting observations of behaviours such as the number of

complaints), the wording of the items (some items were phrased in the form of a question and

others were phrased in the form of a statement) and the format of response categories (varying

from two to thirteen response categories). Hausknecht (1990) noted that the validity of

measurements of satisfaction was rarely assessed; also see Giese and Cote (2000) and

Peterson and Wilson (1992).

It is remarkable that different measures of satisfaction, which were used in different

studies, yielded similar distributions of satisfaction ratings. Peterson and Wilson (1992) noted

that ‘Virtually all self-reports of customer satisfaction possess a distribution in which a

majority of the responses indicate that customers are satisfied and the distribution itself is

negatively skewed.’ They also demonstrated that method-related factors, such as question

format, question context, questionnaire administration, and measurement timing, affected the

average satisfaction ratings and the skewness of distributions of satisfaction ratings. They

concluded that it is not clear what customer satisfaction ratings reflect, that average

satisfaction ratings are not very informative without valid norms for average customer

satisfaction, and that more effort is needed to improve the measurement of customer

satisfaction.

In this section, different measures of satisfaction are discussed in association with the

corresponding definitions of satisfaction as discussed in Section 2 (Table 2). The definitions

of satisfaction and the corresponding measures are listed in Table 3.

Tse and Wilton (1988) used a single-item measure of satisfaction, which was a 5-point

bipolar item with response categories ranging from very dissatisfied to very satisfied. The

item reads: ‘Considering everything, how satisfied are you with the [product]?’. This bipolar

item is a rather common measure of satisfaction, which was also used by others who,

however, used a 7-point rating scale instead of a 5-point rating scale (e.g., Westbrook &

Oliver, 1991, Wirtz & Lee, 2003). Furthermore, the item was used in various multiple-item

measures of satisfaction (e.g., Wirtz & Lee, 2003).

Tse and Wilton (1988) demonstrated that their single-item measure correlated with

disconfirmation and perceived performance. Nevertheless, the measure has three drawbacks.

First, the definition of satisfaction by Tse and Wilton (1988) has a level of abstractness that

does not automatically lead to this specific item. Second, it is a single-item measure of

satisfaction, whereas most theorists suggested the use of multiple-item measures for the

measurement of psychological constructs such as satisfaction because multiple-item measures

better capture the meaning of the construct (e.g., Churchill, 1979; Jacoby, 1976; Messick,

1989; Yi, 1990). Third, Westbrook and Oliver (1991) demonstrated that their 7-point version

of the item performed worse than other measures of satisfaction that were used in the same

study. These three drawbacks call into question the validity of the measurement using a single

Bloemer (1993) proposed a two-step approach to measure satisfaction and

dissatisfaction. First, a person was asked whether he or she was satisfied or dissatisfied with

the focal object. Second, the person was asked how satisfied (or how dissatisfied) he or she

was in terms of, for example, a percentage ranging from 0 to 100. Bloemer’s (1993) measure

correlated with commitment and repeat-purchasing behaviour. However, three comments are

in order. First, the measure lacks a thorough explanation. Bloemer (1993, pp. 79, 128)

conceived of satisfaction and dissatisfaction as two different dimensions, but this does not

s of C

’s re

[ with

t (f.o

, with

om ‘v

fied’

to ‘v

fied’

• Are

tisfie

ith th

• How

tisfie

s of a

r’s c

• on

r a se

l of c

• se

• fiv

ity, (

t of c

s of s

explain the use of a two-step approach to measure satisfaction and dissatisfaction. One may

argue that if satisfaction and dissatisfaction are conceived of as different dimensions, it is

appropriate to separately measure the level of satisfaction as well as the level of

dissatisfaction of each customer. Second, the assessment of the level of satisfaction is based

upon only one item (Bloemer, 1993, p.145), but most theorists advocate multiple-item scales

for the measurement of psychological constructs such as satisfaction (e.g., Churchill, 1979;

Jacoby, 1976; Messick, 1989; Yi, 1990). Third, a study by Westbrook and Oliver (1991), who

used one 11-point item on satisfaction and one 11-point item on dissatisfaction, indicated that

dissatisfaction and satisfaction are opposites on a bipolar dimension. This is in contrast with

Bloemer’s (1993) stance.

Howard and Sheth (1969) did not discuss the measurement of satisfaction, and did not

propose a measure of satisfaction. Measures of satisfaction that are associated with the

definition of satisfaction as a cognition (e.g., Howard & Sheth, 1969) are summated

performance ratings (Oliver, 1997, p. 318). An example is the measurement of customer

satisfaction by means of the sum of a customer’s ratings of features of products and services.

We subscribe to Oliver’s (1997, pp. 33-34, 318) criticism that (a) it is unclear which features

of products and services may be used for the measurement of customer satisfaction and how

these features may be weighted, (b) these measurements do not match the theoretical meaning

of satisfaction, which incorporates the affective content of satisfaction, and (c) these

measurements are useless for research in which the influence of features of products and

services on satisfaction are investigated.

Fornell (1992) proposed a measure of summary satisfaction (or cumulative satisfaction)

that was composed of three 10-point bipolar items. The items concerned (a) global

satisfaction of the customer with the product, service, or company, (b) disconfirmation of

expectations of the customer regarding the product, service, or company, and (c) the distance

from the customers’ hypothetical ideal product, service, or company. The measure was

incorporated in the Swedish Customer Satisfaction Index, the Norwegian Customer

Satisfaction Index, and the American Customer Satisfaction Index (e.g., Fornell, 1992;

Fornell, Johnson, Anderson, Cha, & Bryant, 1996; Johnson, Gustafsson, Andreassen, Lervik,

& Cha, 2001), and it was used in various empirical studies (e.g., Anderson, Fornell, &

Lehmann, 1994; Anderson, Fornell, & Mazvancheryl, 2004; Anderson & Mittal, 2000; Gruca

& Rego, 2005; Ittner & Larcker, 1998). This strengthened the confidence in the quality of the

measure. However, the measure lacks correspondence with Fornell’s (1992) definition of

satisfaction, meaning that it is not obvious why the abstract definition of satisfaction resulted

in this particular measure of satisfaction and not in another measure. For example, Verhoef

(2001, p. 18, p. 57) developed a measure of satisfaction on the basis of the definition by

Fornell (1992; also Anderson, Fornell, & Lehmann, 1994) that was much different from

Fornell’s (1992) measure. Verhoef’s (2001) measure of satisfaction was the total score (or the

factor score derived from confirmatory factor analysis) on seven items regarding satisfaction

with the company. An example of such item was ‘How satisfied are you with the personal

attention of XYZ’?, which had five response categories, ranging from very dissatisfied to very

satisfied (the seven items were in the same format). Because Fornell’s (1992) definition does

not provide many clues for constructing measures, it is difficult to judge whether Fornell’s

(1992) and Verhoef’s (2001) measures correspond with this definition.

Oliver (1997, p. 343) proposed a measure of summary satisfaction that incorporates

different phenomena together defining the meaning of satisfaction. First, Oliver noted that

satisfaction is best measured using a multiple-item scale. Second, he noted that the measure

should contain an anchor item, which is an item formulated in terms of general satisfaction

with the product or the service provided. Third, Oliver listed several aspects or antecedents of

satisfaction that may be incorporated in a measure of satisfaction, such as performance

evaluations, expectations, disconfirmation, need fulfilment, dissonance, and affects. Fourth,

he included several items that are counter-indicative of satisfaction and, consequently,

indicative of dissatisfaction. This is consistent with the conception of dissatisfaction as the

opposite of satisfaction on a bipolar dimension, and with general psychometric principles

regarding the measurement of psychological constructs (e.g., Oort, 1996).

The inclusion of items on various phenomena in Oliver’s (1997) measure of summary

satisfaction does not imply that the author conceived of summary satisfaction as a

multidimensional construct. The dimensionality of a construct is ultimately an empirical

question, and empirical research (e.g., Mano & Oliver, 1993; Oliver & Swan, 1989; Oliver,

1993; Wirtz & Lee, 2003) has supported the conception of summary satisfaction as a

unidimensional construct.

Oliver’s (1997) measure of summary satisfaction was composed of twelve 5-point

Likert items. Seven items were indicative of satisfaction and five items were counter-

indicative of satisfaction. The measure was accommodated to the measurement of satisfaction

with one’s car. An earlier version of the measure was composed of six 5-point Likert items

(Oliver, 1980), and was accommodated to the measurement of satisfaction with a flu

vaccination program.

Oliver (1997) argued that the optimal composition of a measure depends on (a) the

research topic and (b) the research purpose. For example, if a particular phenomenon such as

disconfirmation has to be related to satisfaction, it should not be incorporated in the

satisfaction measure (Oliver, 1997, p. 343). This is in accordance with the psychometric

principle regarding divergent validity (Campbell & Fiske, 1959).

Giese and Cote (2000) argued that a measure of satisfaction should be context-specific

and, as a result, they did not propose a measure of satisfaction that is generally applicable.

The absence of a general measure is consistent with the view that satisfaction may have

different meanings in different contexts, and contrasts Fornell’s (1992) position that resulted

in a measure that was applicable across a variety of industries.

Three remarks are in order with respect to the measurement instruments of satisfaction

listed in Table 3. First, the correspondence between a particular definition of satisfaction on

the one hand and a particular measurement instrument of satisfaction on the other hand is

often ambiguous. Thus, it is not obvious why a particular definition of the construct resulted

in a particular measurement instrument for satisfaction, and not in another one. This lack of

clarity may be due to the generality of most definitions of satisfaction, which did not provide

sufficiently many clues for the development of a measurement instrument of satisfaction. For

example, the definition of satisfaction by Fornell (1992; see also Anderson, Fornell, &

Lehmann, 1994) was used as a justification for two very different measurement instruments of

customer satisfaction.

Second, construct validity has been underexposed. Satisfaction studies yielded evidence

of convergent, divergent, and nomological validity of measurements of satisfaction (e.g.,

Oliver, 1980; Oliver & Burke, 1999; Tse & Wilton, 1988; Verhoef, 2001; Westbrook &

Oliver, 1991; Wirtz & Lee, 2003), but failed to address the main threats to construct validity,

which are construct underrepresentation and construct-irrelevant variance (Messick, 1989).

For example, except for Oliver’s (1997) measure it was insufficiently investigated whether the

measures sufficiently represented the construct, and none of the other studies investigated

contamination of measurements with method-related irrelevant variance.

Third, the usefulness of satisfaction research for the further development of satisfaction

theory may be enhanced by the further improvement of measurement instruments of

satisfaction. Because the meaning of satisfaction is context-specific, such measurement

instruments may be developed on the basis of context-specific definitions of satisfaction. This

implies the development of different measurement instruments for satisfaction for different

research domains (also, see Giese & Cote, 2000).

7 Discussion

Satisfaction may be considered a response to disconfirmation; thus, the process that evokes

the satisfaction response is at the centre of attention. The definitions associated with this

conception are process definitions. They describe the process that evokes the satisfaction

response, but fail to explain what the satisfaction response is (Oliver, 1997, pp. 12-13).

Alternatively, satisfaction may be considered a valenced response to consumption. Here, the

content of the satisfaction response is central. Because the meaning of satisfaction concerns

the content of the satisfaction response (for a more general discussion, see Sartori, 1984;

Schouwstra, 2000), we consider the latter conception more useful for defining satisfaction

than the former conception.

The prototypical definitions associated with the conception of satisfaction as a valenced

response to consumption differ with respect to the specification of the properties of the

satisfaction response and the level of detail of the explanation of the satisfaction response.

First, Howard and Sheth (1969) defined satisfaction as a cognitive response to consumption,

whereas Giese and Cote (2000) defined satisfaction as an affective response to consumption.

Second, Fornell (1992) provided a generic definition of satisfaction, whereas Oliver (1997)

provided a detailed definition of satisfaction. As was noted in Section 6, Fornell’s (1992)

definition of satisfaction was too generic for the development of a measurement instrument of

satisfaction. Following Giese and Cote (2000), we think that a sufficiently detailed definition

of satisfaction requires the specification and the explanation of (a) the type of satisfaction

response, (b) the focal object of the satisfaction response, and (c) the timing of the satisfaction

response.

There is no consensus definition of satisfaction, which probably is due to the context-

specific nature of satisfaction (Giese & Cote, 2000). Therefore, we subscribe to Giese and

Cote’s (2000) recommendation to develop context-specific definitions and corresponding

measurement instruments of satisfaction. Because the meaning of satisfaction is context-

dependent, we do not agree with Giese and Cote (2000) that satisfaction is limited to affective

responses to consumption experiences. Oliver (1997) demonstrated that satisfaction can have

cognitive content and affective content, because it can manifest in performance evaluations,

expectations, disconfirmation, regret, and emotions. Whether the cognitive content or the

affective content prevails, depends on the research domain and on characteristics of the

person (Oliver, 1997, pp. 316-318).

Four additional remarks are in order to explain satisfaction in the context of retail

banking. First, satisfaction pertains to the satisfaction of the customers of the bank. For this

reason, we consider customer satisfaction the best term for satisfaction in the context of retail

banking. Second, consumption of products and services from a retail bank is an ongoing

process. Persons remain customer of a bank for a long period of time, in which they make use

of products and services from the company, and maintain some contact with the company. In

this context, customer satisfaction results from the accumulation of encounters with the

company. Third, because customer satisfaction may result from unappraised affects, appraised

affects, unappraised cognitions, and appraised cognitions, the construct includes both manifest

customer satisfaction and latent customer satisfaction. Fourth, because a customer’s

satisfaction with a bank may range from very satisfied to very dissatisfied, customer

satisfaction is the opposite of customer dissatisfaction on a bipolar dimension. In this study,

each of these four remarks was taken into account.

Explicit definition of customer satisfaction with a retail bank

Giese and Cote (2000) rightly argued that the meaning of satisfaction is context-specific, and

that the definition and measurement of satisfaction also need to be context-specific. It is not

possible to develop an explicit definition of satisfaction that grasps the meaning of satisfaction

in all contexts. It is more fruitful to analyse the meaning of customer satisfaction within a

particular context, and then develop a context-specific definition. This study pertains to

customer satisfaction with a retail bank, and it is limited to summary satisfaction. In this

context, customer satisfaction

(a) is limited to the satisfaction of customers of the company;

(b) pertains to the company as a whole, and not to single products or services;

(c) results from the accumulation of encounters of customers with the company;

(d) results from the psychological processing of consumption outcomes;

(e) covers customers’ affects and cognitions reflecting a value judgement;

(f) may result from appraised affects, appraised cognitions, unappraised affects, and

unappraised cognitions;

(g) becomes manifest in customers’ performance evaluations, expectations,

disconfirmation, emotions, and regret; and

(h) is the opposite of customer dissatisfaction on a bipolar dimension.

These eight characteristics explain the content of customer satisfaction with a retail

bank. We summarise them accordingly: customer satisfaction with a retail bank is the

valenced response of the customer, directed towards the retail bank, and evoked by the

customer’s experiences with the retail bank throughout time. This is the explicit definition of

customer satisfaction with the retail bank. It may be noted that the definition covers the three

components, which Giese and Cote (2000) required from a definition of satisfaction. First,

satisfaction is conceived of as the customer’s valenced response. Second, the focus of the

customer’s response is the retail bank. Third, the timing of the response is during or after the

customer’s experiences with the retail bank. Because evaluations range from positive to

negative, dissatisfaction is simultaneously defined as the opposite of satisfaction on a bipolar

dimension.

Implicit definition of customer satisfaction with a retail bank

Whereas the explicit definition addresses the construct, the implicit definition addresses the

construct’s relations to other constructs and variables (Schouwstra, 2000, p. 61). Therefore,

the implicit definition of customer satisfaction is founded on the nomological network of the

construct, which was discussed in section 5 of this chapter.

Customer satisfaction with a retail bank is implicitly defined in terms of its relations

with trust, quality, and customer loyalty, and its influence on customer profitability. As a

consequence, it is expected that overall satisfaction with a retail bank is positively related to

(a) trust in the company, (b) quality perceptions regarding the products and services provided

by the company, (c) loyalty to the company, and (d) future customer profitability.

8 Conclusions

1. The meaning of customer satisfaction differs between and within contexts. For this

reason (a) it cannot be sharply defined but it needs to be explained by means of

examples, and (b) the examples are context-dependent.

2. Dissatisfaction may be conceived of as the opposite of satisfaction on a bipolar

dimension. This means that satisfaction/dissatisfaction is expected to constitute a

unidimensional construct, and that customers are not both satisfied and dissatisfied with

the same phenomenon at one point in time.

3. Customer satisfaction with a retail bank is explicitly defined as the valenced response of

the customer that is directed towards the bank and that is evoked by the whole of

consumption experiences with the bank. This definition encloses various cases that are

mutually related by family resemblances.

4. Customer satisfaction with a retail bank is implicitly defined on the basis of its

connections with other psychological constructs and with behaviour. In the domain of

retail banking, the relations of satisfaction with (a) trust, (b) quality, (c) customer

loyalty, and (d) future customer profitability are considered most important.

5. Many measures of satisfaction have been reported in the marketing literature, and

different measures of satisfaction are associated with different definitions of the

construct. However, evidence of construct validity of most measures of satisfaction is

absent. This limits the usefulness of satisfaction research for the development of

satisfaction theory.

6. The usefulness of satisfaction research for the development of satisfaction theory may

be enhanced by further improvement of the measures of satisfaction. The improvement

of measures of satisfaction entails (a) explication of the context-specific meaning of

satisfaction, (b) explication of correspondence between the definition and the measure of

satisfaction, and (c) assessment of validity of measurements of satisfaction. In the next

chapters, we will develop a context-specific measurement instrument of satisfaction and

validate the measurements obtained with this instrument.

Chapter 4

Deductive design for test development and construct validation

1 Introduction

Psychological properties can be measured by means of psychological tests (Chapter 1, Section

4). A psychological test is an instrument which elicits behaviour that is representative of the

property of interest and which can be used to measure the extent to which a person possesses

the property. A test may consist of a well-chosen set of items that are administered in a

survey. On the basis of the responses a person provides to these items, his or her position on

the scale for the property is inferred.

This chapter addresses the design of the empirical study. The purpose of the study was

to develop a measurement instrument for customer satisfaction with retail banks, and to test

the relations of customer satisfaction with constructs and variables in the corresponding

nomological network. For this purpose, we applied the deductive design (Schouwstra, 2000)

for test development and construct validation.

2 The deductive design

The deductive design (Schouwstra, 2000) is a methodology for test development and

construct validation for typical-behaviour properties such as customer satisfaction. The

methodology departs from a theoretical analysis of the construct of interest. In this respect, it

is consistent with the deductive approach to test development (Oosterveld, 1996).

Following Messick (1989, pp. 13, 34), Schouwstra (2000, p. 57) defined construct

validity as ‘an evaluative judgement of the trustworthiness of a test-score interpretation in

terms of a construct’. Messick (1989, pp. 34-35, 1995) addressed two general threats to

construct validity, which are (a) construct underrepresentation, and (b) measurement of

irrelevant variance. Construct underrepresentation occurs when only a part of the construct is

measured. For example, a test measures only a part of the construct of customer satisfaction

with a focal object when it only includes items that reflect cognitions about the object but no

items that reflect affects. Measurement of irrelevant variance occurs when not only the

construct is measured, but also other psychological properties, attributes related to group

membership, or response tendencies. For example, a test for customer satisfaction measures

more than just the intended construct when it also includes items that require a high level of

verbal intelligence to be comprehended. Then, the test scores also depend on verbal

intelligence, and the variation in test scores that is caused by variation in verbal intelligence is

conceived of as irrelevant variance. Also, a test for customer satisfaction may be administered

to one part of the sample by telephone and to another part by the Internet, and as a result of

these different administration modes different response categories may be used. Now, test

scores partly depend on administration mode, and the variation in test scores that is caused by

differences in the administration procedure is conceived of as irrelevant variance. Both

construct underrepresentation and irrelevant variance refute the interpretation of test scores in

terms of a reflection of the construct and nothing else (Messick, 1989; Schouwstra, 2000).

Hence, construct validation concerns the assessment of construct representation and absence

of irrelevant variance.

Following Anastasi (1986), Schouwstra (2000) argued that construct validation should

start at the outset of test development. This stance is reflected in the deductive design, which

demands two lines of evidence for construct validation (Table 1; from Schouwstra, 2000, p.

60). The first line of evidence should be made of rationales underlying the test-score

interpretations, and the second line of evidence should be made of empirical evidence that the

test score reflects the complete construct and nothing else. Each line of evidence should

address construct representation and absence of irrelevant variance in test scores.

Table 1: Outline of Construct Validation Within the Deductive Design (Schouwstra, 2000, p. 60) Scientific arguments Construct representation Irrelevant variance

Rationales

a. Formulation

b. Translation

c. Modelling

Of what construct of interest is

Of construct of interest into test content

How test score reflects construct

And what not

And nothing else

Empirical evidence That test score reflects whole of construct And nothing else

The rationales consist of an explanation of how the test-score interpretations are derived

from the theory about the construct (Schouwstra, 2000). First, this explanation requires

formulating what the construct of interest is and what it is not, and to which other constructs it

is related. The construct has to be defined explicitly by means of the specification of the

aspects and attributes to which it refers, and implicitly by the specification of related concepts

that constitute the nomological network. Second, the way in which the construct definition is

translated into test content needs to be specified. This specification involves the formulation

of (a) guidelines concerning the formulation of items that reflect the construct and nothing

else, (b) guidelines for acts that control for possible response tendencies, and (c) the items,

which constitute the operationalisation of the construct. Third, the measurement model that is

expected to fit the empirical data needs to be specified. This specification includes the

explanation of the relationship between the items and the test score.

The empirical evidence consists of results from empirical research into the test-score

interpretations. Following Cronbach (1988, 1989), Schouwstra (2000, pp. 1-3) noted that a

strong version of construct validity research involves the testing of hypotheses about what a

test score measures and what it does not measure. These hypotheses refer to (a) the explicit

construct representation, (b) the implicit construct representation, (c) concept-related

irrelevant variance, and (d) method-related irrelevant variance (Schouwstra, 2000, pp. 68-71).

The explicit construct representation of test scores encompasses content validity,

convergent validity, and divergent validity, and is assessed on the basis of tests of

corresponding hypotheses. The implicit construct representation pertains to the nomological

validity of test scores, and is assessed on the basis of tests of hypotheses regarding the

relationship of test scores with measures of other concepts in the nomological network.

Method-related irrelevant variance pertains to variance caused by phenomena that are not

related to the construct of interest, such as response tendencies and characteristics of the

research method. Concept-related irrelevant variance pertains to variance caused by

phenomena that are related to the construct of interest, such as the concepts in the

nomological network and properties related to group membership. Both method-related

irrelevant variance and concept-related irrelevant variance are investigated on the basis of

tests of hypotheses regarding the contamination of test scores by other properties and

variables. The methodology to test hypotheses regarding the contamination of test scores is

addressed in the next section.

Both lines of evidence need to be integrated into an evaluative judgement of the validity

of the test-score interpretations (Schouwstra, 2000, p. 71). This judgement reflects the

interpretation whether and to what extent the evidence supports the interpretation of test

scores in terms of the construct of interest, and nothing else. The more comprehensive the

argumentation for the test-score interpretation, the more convincing the support for construct

validity. However, the support is never conclusive. First, construct validation is an unending

process that includes the judgement of evidence gathered in the processes of test development

and test use (Anastasi, 1986, 1988; Cronbach, 1971, p. 452; Messick, 1989, p. 13). Second,

construct representation is to some extent arbitrary, because constructs do not have sharp

boundaries (e.g., Wittgenstein, 1953, 1958). Third, it is not possible to exclude all irrelevant

variance in the context of psychological measurement. For example, most psychological tests

require linguistic skills of participants, and the extent to which a participant possesses these

skills can influence his or her response behaviour (e.g., Schouwstra, 2000, p. 63). When a test

is used in a sample containing participants with different levels of education, test scores are

readily biased with respect to varying linguistic skills of participants. This example implies

that construct validity is almost always imperfect.

Summarising, the deductive design is a methodology for test development and test-score

validation. First, it is directed towards the development of tests that encompass all the

important aspects of the construct of interest. Second, the deductive design is directed towards

the minimisation of test-score variance that is irrelevant to the construct of interest

(Schouwstra, 2000, pp. 81-83). The methodology encompasses the explication of rationales

and the collection of empirical evidence with respect to the interpretation of test scores.

3 The theory of violators

The theory of violators (Oort, 1996) addresses a methodology to test hypotheses with respect

to the contamination of test scores by other variables, such as other traits, attributes related to

group membership, or response styles. In the theory of violators, irrelevant variance is

conceived of as variance caused by phenomena that violate the unidimensionality of the scale

(Oort, 1996). The theory of violators is based upon the following definitions of item bias and

unidimensionality (Oort, 1996, p. 7): ‘A scale consisting of a set of items is unidimensional if

and only if each of the items is unbiased with respect to every potential violator that might be

relevant in whatever context the test might be used’, and ‘An item I is unbiased with respect to

a potential violator V and given trait T if and only if, for all values i and v and t:

P(I=i | T=t, V=v) = P(I=i | T=t).’

The theory of violators requires local independence between item and violator, meaning

that the probability of endorsement with item I, given trait-score t, is independent of violator-

score v. Marginal independence between item and violator is not required, meaning that it is

not required that the probability of endorsement to item I is independent of violator-score v.

Let rest-score R be the total score of a person on the set of items measuring trait T minus

the score on an item I. Then, item bias may be investigated by means of the partial correlation

of an item I and violator V while controlling for the rest-score R. Oort (1996) suggested

restricted factor analysis (to be discussed in Chapter 6) for testing the hypothesis that test

scores are not contaminated by a violator V.

The theory of violators provides a useful methodology for empirical research into the

contaminating effects of violators on test scores. Nevertheless, three comments are in order.

First, research into the unidimensionality of a scale cannot exclude all irrelevant variance that

may threaten the interpretation of test scores in terms of the construct of interest. For example,

characteristics of the measurement instrument (e.g., the method of administration and the

question format; Bradburn, 1983) may affect the magnitude of test scores without affecting

the unidimensionality of the scale. Second, multidimensionality does not necessarily imply

that the measurement is invalid. For example, a particular construct may encompass different

attributes (e.g., intelligence may encompass verbal intelligence and spatial intelligence;

Gardner, 1993), and the measurement of the construct may turn out multidimensional instead

of unidimensional. Third, perfect unidimensionality seems impossible because it is unlikely

that the items of a scale would be unbiased for all possible violators (e.g., Oort, 1996, pp. 18-

19). This is in agreement with the notion that it is impossible to eliminate all irrelevant

variance in psychological measurement, and it endorses the notion that a judgement of

construct validity has to be qualitative and gradual by nature (e.g., Cronbach & Meehl, 1955).

4 Purpose of the study and conditions for test development

The purpose of this study was to develop a measurement instrument for customer satisfaction

with retail banks, and to validate theory regarding the meaning of customer satisfaction in the

domain of retail banking. Given the context of this study, the measurement instrument had to

be accommodated to the meaning of customer satisfaction with a retail bank, and it had to be

used in empirical research in the corresponding domain.

The population of interest in this study consisted of the mature customers of a Dutch

retail bank, in this study denoted as BANK. The measurement instrument had to be applied in

survey research to a sample from this population, and therefore it was administered in Dutch.

Furthermore, the instrument had to comply with requirements regarding the composition of

questions and questionnaires used in surveys (e.g., Belson, 1986; Dillman, Tortora, &

Bowker, 1998; Sheatsley, 1983; Sudman & Bradburn, 1982).

5 Test development

Test development is the development of the measurement instrument. In Chapter 3 (Section

7), customer satisfaction with a retail bank was explained on the basis of eight characteristics,

which were summarised in the explicit definition: customer satisfaction with a retail bank is

the valenced response of the customer, directed towards the retail bank, and evoked by the

customer’s experiences with the retail bank throughout time. Furthermore, customer

satisfaction was defined implicitly by its connections with trust, quality, customer loyalty, and

customer profitability. The latter concepts are part of the nomological network of customer

satisfaction in the domain of retail banking, and they delineate the construct to a large extent.

The explicit definition of customer satisfaction with a retail bank covers the three

components Giese and Cote (2000) required of a definition of satisfaction, which are response

type, timing of the response, and focus of the response (also, see Chapter 3). The explicit

definition was used here to formulate a facet design (Table 2) with three facets representing

the three components (the response focus facet was not reflected in Table 2, because it had

one element). The facet response type had two elements (i.e., cognitive response and affective

response), the facet time frame had two elements (i.e., present and past), and the facet

response focus had one element (i.e., the bank). Thus, the facet design had four structuples.

Table 2: The Facet Design for Customer Satisfaction with a Retail Bank Response type / Time frame Present Past

Cognitive Structuple 1 Structuple 2

Affective Structuple 3 Structuple 4

The purpose of the design was to facilitate the formulation of an item set that yields a

complete construct representation. Following Oliver (1997, p. 343), we chose to formulate a

comprehensive set of items of the Likert type (Likert, 1932). This type of items allows for the

construction of items that are (a) expected to be monotonically related to the construct of

interest, and (b) either indicative or counter-indicative of the construct of interest (e.g., Oort,

1996). The following specifications guided the formulation of the items:

1. Each structuple is represented by two items. One item should be indicative and the

other counter-indicative of the construct (e.g., Fabrigar, Krosnick, & MacDougall,

2005; Likert, 1932), in order to represent both poles of the

satisfaction/dissatisfaction continuum. In order to prevent the questionnaire from

becoming too long and ask too much of the participants, the number of items for

each structuple was limited to two.

2. Each item should be monotonically related to customer satisfaction. In the context of

this study, this means that the probability of choosing a particular answer category or

a higher answer category in response to a positively worded item, should be a

monotonically nondecreasing function of customer satisfaction (i.e., a function that

decreases nowhere along the scale. Instead, the function either increases

monotonically, remains constant, or increases across some intervals of the scale and

remains constant across other intervals; henceforth, to keep things simple we call

this monotonicity; see Sijtsma & Molenaar, 2002, pp. 20, 119). Negatively worded

items are re-coded prior to data analysis, and monotonicity should hold as well.

3. Each item should reflect general satisfaction with the company. For this reason, the

subject in each item should be the bank, and not a particular transaction, product, or

product feature.

4. The wording of the items should be kept simple and unambiguous (e.g., Belson,

1986). This means, for example, that the items should be kept short and easy to

understand, and that negations should be avoided.

5. The item set should contain one anchor item (Oliver, 1997, p. 343), which is an item

that is formulated in terms of satisfaction with the company (i.e. an item such as I

am satisfied with BANK).

6. None of the items should be phrased in terms of related constructs such as trust,

quality, and customer loyalty (Chapter 3). This means that items should not be

phrased in terms of (a) preference for the company over other companies, (b)

expectations regarding competence and integrity of the company, and (c) attributes

of products and services provided by the company.

On the basis of these specifications, a set of nine items was formulated (Table 3). The

set contained one anchor item and eight items representing the four structuples (Table 2). All

items were of the Likert type with five ordered response categories that ranged from totally

agree to totally disagree. We chose five response categories in order to also include a neutral

option. Because satisfaction/dissatisfaction was conceived of as a continuum (Chapter 3), it

was expected that a unidimensional measurement model would fit the empirical data and that

all items were monotonically related to the satisfaction/dissatisfaction dimension.

Table 3: Items of Customer Satisfaction with BANK Item Structuple Aspect

I am satisfied with BANK None General satisfaction

BANK meets all my requirements for a bank 1 Need fulfilment

There are good reasons to leave BANK (*) 1 Cognition

BANK has met my expectations 2 Disconfirmation of expectations

Last year I had some problems with BANK (*) 2 Cognition

At BANK I feel at home 3 Affect

I have mixed feelings about BANK (*) 3 Affect

Last year I had a pleasant relationship with BANK 4 Affect

I have regretted my choice for BANK (*) 4 Regret

(*) = item is counter-indicative of customer satisfaction with BANK

6 The measurement model

A measurement model is a statistical representation of the responses of the participants of a

survey to the measurement instrument. If the measurement model represents the data well, a

scale for measurement and measurement values for the participants follow from the model. It

was hypothesised that a unidimensional measurement model fits the data, and that all items

are monotonically related to the underlying construct of satisfaction with BANK (see Section

5). The Mokken model of monotone homogeneity (MH model; Mokken, 1971) was used for

this investigation (Chapter 5 through Chapter 8).

The MH model is an item response theory (IRT) model. IRT is a psychometric theory

about the relation between a trait and the probability of a particular response to an item

reflecting the trait. The relationship is typically represented by an item response function

(IRF). For dichotomously scored items (i.e., two scores, often 0/1 for disagree/agree), the IRF

reflects the probability of endorsement (i.e., score 1) with an item given a particular position

on the trait (e.g., see the MH model for dichotomous items; Sijtsma & Molenaar, 2002, p. 11).

For polytomously scored items (i.e., three or more ordered scores, reflecting degrees of

endorsement, e.g., 0, 1, 2, 3, 4), the item step response function (ISRF) reflects the probability

of choosing a particular answer category or a higher category (i.e., at least a score of x; e.g., x

= 0, 1, 2, 3, 4) of an item given a particular position on the trait (e.g., the MH model for

polytomous items; Sijtsma & Molenaar, 2002, p. 119).

The MH model is based upon three assumptions (Sijtsma & Molenaar, 2002, pp. 18-21).

The first assumption is unidimensionality, which means that all items reflect the same trait,

for example, customer satisfaction. The second assumption is local independence, which

means that, given a fixed value of the latent trait, the probability of obtaining at least a score

of x is unrelated to the scores obtained on the other items in the test. This means that items

reflecting customer satisfaction are unrelated in a group of persons who have the same level

of customer satisfaction. This may sound odd but local independence is a mathematical way

of saying that only customer satisfaction explains relationships among items measuring

aspects of this trait, and if the trait is held constant all remaining variation in the item scores is

due to error. The third assumption is monotonicity, which means that the probability of

obtaining at least a score of x is a non-decreasing function of the latent trait. Thus, the higher

one’s level of customer satisfaction the higher the probability of obtaining a high score on

items measuring the trait.

A consequence of the assumptions of unidimensionality, local independence, and

monotonicity is that the MH model yields ordinal measurements of the trait. This is different

from the numerical measurements obtained by more-demanding parametric IRT models, such

as the Rasch model (Rasch, 1960), but ordinal measurements suffice for many measurement

purposes. In particular, let the score of person p on item i be denoted Xpi, and let the sum score

or the total score of participant p on the items (indexed i) in the test be defined as the sum of

the item scores, , then under the MH model the ordering of the participants by

means of their values reflects their ordering on the scale of the latent trait, except for

measurement error (Sijtsma & Molenaar, 2002, p. 121; also, see Van der Ark, 2005).

∑=+ i pip XX

Empirical research has demonstrated that the total-score has a strong linear

correlation with the estimated latent trait value from parametric IRT models in several

measurement applications (e.g., Sijtsma, Emons, Bouwmeester, Nyklicek, & Roorda, 2008).

Sijtsma et al. (2008) suggested the use of total-score for diagnostic purposes, such as the

assessment of the position of a person on the latent trait, and for statistical analyses, such as

the comparison of groups or the measurement of change. A general condition for this use of

is that the MH model fits the data. +X

The extent to which the theoretical data structure predicted by the MH model is different

from the observed data is expressed by means of total-scale scalability coefficient H

(Loevinger, 1948; Mokken 1971) for the whole set of items, and item scalability coefficient

for individual items. Coefficient H ranges from a negative value, depending on several

characteristics of the item scores, to the maximum of 1. For a given distribution of total-score

and a particular set of monotone increasing ISRFs, as the slopes of the ISRFs become

steeper, item scalability coefficients and total-scale scalability coefficient H have higher

positive values, gradually approaching 1 as the slopes become nearly vertical. Thus, high

positive values (usually, and

3.0≥iH ≥H 0.3; Sijtsma & Molenaar, 2002, p. 60) of item

scalability coefficients and total-scale scalability coefficient H in a data set are taken as

evidence of steeply monotone ISRFs, and this in turn means that person ordering by means of

is more reliable.

A virtue of an analysis by means of the MH model is the availability of the MSPwin5.0

software (software for Mokken Scale analysis for Polytomous items; Molenaar & Sijtsma,

2000). MSPwin5.0 facilitates to investigate statistically whether the MH model fits the data.

In particular, it facilitates (a) the investigation of the dimensionality of an item set using a

confirmatory strategy, or (b) the investigation of the dimensionality of an item set using an

exploratory strategy, and (c) the test of the assumption of monotonicity. Furthermore,

MSPwin5.0 provides the test-score distribution and interesting summary statistics, such as the

mean, the standard deviation, and the skewness of this distribution (Molenaar & Sijtsma,

2000, pp. 60-61).

The confirmatory strategy to investigate the dimensionality entails the investigation

whether a set of items, which is defined a priori to form a scale, indeed is found to be a scale

based on values of the item scalability coefficients and total-scale scalability coefficient H

in the sample data set from the population of interest. To have a Mokken scale, all inter-item

correlations must be positive and the values of and H must be at least 0.3 (Sijtsma &

Molenaar, Chapter 5). A Mokken scale is unidimensional and allows sufficiently reliable

person measurement by means of total-score . MSPwin5.0 facilitates this strategy by

means of the item selection method Test (Molenaar & Sijtsma, 2000, p. 48).

The exploratory strategy to investigate the dimensionality entails the clustering of items

from a larger set into smaller clusters (one cluster is also allowed), each of which is

characterised by positive inter-item correlations and item scalability coefficients and total-

scale scalability coefficient H that are at least 0.3. Thus, each cluster represents a Mokken

scale. MSPwin5.0 facilitates this search strategy by means of the item selection methods

Search normal (forms item clusters from a set of items) and Search extended (takes the

second, third, and so on, Mokken scale found by means of Search normal as point of

departure for clustering while leaving the other items free for selection), and the option to

choose different lower bounds than the default value 0.3 for item scalability coefficients

and total-scale scalability coefficient H (Molenaar & Sijtsma, 2000, p. 40).

The assumption of monotonicity can be investigated for every ISRF of every item, by

estimating the ISRFs from the data. An item which has five different item scores, has five

different ISRFs, which are conditional probabilities )|( θxXP i ≥ , in which x = 0, …, 4 and θ

stands for the latent trait. Because every participant has one of the five possible scores, the

probability of obtaining at least a score of 0 equals 1 (a participant always has one of the

scores). Thus, only the four ISRFs for x = 1, …, 4 are of interest. In data analysis, when the

ISRFs of item i are estimated, the latent trait is replaced by the rest-score R. Rest-score R is

the total-score minus the item-score . The use of the total-score would lead to

heavily biased estimates of the ISRFs of item i, and this is prevented by using rest-score R.

+X iX +X

A rest-score group contains all participants having equal rest scores. The assumption of

monotonicity is violated in the sample if the probability of obtaining a score on item i of at

least x is higher for a lower rest-score group than for a higher rest-score group. MSPwin5.0

provides an option called Minsize for the manipulation of the minimum size of the rest-score

groups (adjacent rest-score groups may be merged to obtain sufficiently large groups; this is

convenient for small and large scores which are often underrepresented in samples), and an

option Minvi which defines the minimum value of observed violations of monotonicity in

sample ISRFs that are subjected to statistical testing (small violations may be uninteresting

irrespective of whether they are significant or not; Molenaar & Sijtsma, 2000, pp. 67-73). In

MSPwin5.0, the default value for Minsize is 10 percent of the sample size, and the default for

Minvi is 0.03 on a probability scale that runs from 0 to 1. The option Alpha = p manipulates

the significance level for tests of significance of violations of monotonicity. Default in

MSPwin5.0 is Alpha = 0.05.

It was hypothesised that the nine items of satisfaction with BANK constitute a scale

according to the MH model. This hypothesis was tested in sample data from the population of

interest. If the MH model fits the data, a scale according to the MH model can be constructed,

and the scale scores can be computed.

7 Hypotheses

This section addresses the formulation of hypotheses regarding characteristics of the

satisfaction scores (i.e., the satisfaction with BANK scale-scores). The hypotheses concerned

(a) the explicit construct representation, (b) the implicit construct representation, (c) concept-

related irrelevant variance, and (d) method-related irrelevant variance, and they were tested in

empirical studies with respect to customer satisfaction (Chapter 5 through Chapter 8). The

purpose of the tests of the hypotheses was to gather empirical evidence whether the scale

scores can be interpreted in terms of satisfaction with BANK, and nothing else.

Explicit construct representation

First, it was expected that persons attached different connotations to the term satisfaction

when asked to explain what satisfaction with the company meant to them. This expectation

was in line with the theory of Oliver (1997) that satisfaction may result from different

processes, and the notion by Wittgenstein (1953, 1958) that the linguistic meaning of a term

cannot be delineated sharply. Second, it was expected that the nine items (Table 3) constituted

a scale according to the MH model (Section 6). Third, it was expected that the satisfaction

with BANK scale-scores were positively related to other satisfaction with BANK scores. This

was in agreement with the requirement of convergent validity (Campbell & Fiske, 1959).

Implicit construct representation

Customer satisfaction was expected to be positively related to (a) trust, (b) quality, (c)

customer loyalty, and (d) future customer profitability. The associations between these

concepts were postulated in the nomological network of customer satisfaction (Chapter 3).

Concept-related irrelevant variance

Concept-related irrelevant variance refers to variance due to variables that are presumably

related to the construct of interest. Variables that are presumably related to customer

satisfaction are the variables in the nomological network of the construct (Chapter 3). In terms

of the theory of violators (Oort, 1996), such variables are possible violators of the

unidimensionality of the scale of the construct of interest. The measurement instrument for

customer satisfaction was constructed with the purpose to minimise contamination of scale

scores by these variables (Section 5). Therefore, it was expected that trust, quality, customer

loyalty, and current customer profitability did not contaminate satisfaction scores obtained by

the satisfaction with BANK measurement instrument.

Method-related irrelevant variance

Method-related irrelevant variance refers to variance caused by variables that are presumably

unrelated to the construct of interest, such as characteristics of the method of the study and

response styles of persons. Characteristics of the method that may affect response behaviour

are, for example, the mode of administration, the format of items, the item order, and the

wording of items (e.g., Bradburn, 1983). There is ample evidence of the effect of these

phenomena on the person’s responses to items (e.g., Belson, 1981, 1986; Bradburn, 1983;

Bronner & Kuijlen; 2007; Krosnick, 1999; Schuman & Presser, 1981, Sheatsley, 1983). The

classical example was provided by Rugg (1941), who demonstrated that 46% of the

participants in a survey supported free speech when asked ‘Do you think the United Stated

should forbid public speeches against democracy’, while only 25% of the participants

supported free speech when asked ‘Do you think the United States should allow public

speeches against democracy.’ Thus, the question phrased in terms of to allow yielded

different results than the question phrased in terms of to forbid. Schuman and Presser (1981,

pp. 276-278) replicated this result.

Paulhus (1991, p. 17) explained a response style of a person as a consistent tendency of

a person to respond to questionnaire items on some basis other than the specific item content

(i.e., what the items were designed to measure). Examples of response styles are

acquiescence, disacquiescence, midpoint responding, extreme responding, noncontingent

responding, and socially desirable responding (e.g., Baumgartner & Steenkamp, 2001, 2006;

Paulhus, 1991; Van Herk, 2000). The acquiescence response style is defined as a general

preference for the agreement response categories of item scales, and the disacquiescence

response style is defined as a general preference for the disagreement response categories of

item scales. These two response styles may be investigated by means of control scales.

Theorists (e.g., Baumgartner & Steenkamp, 2001, 2006; Knowles & Nathan, 1997; Paulhus,

1991; Van Herk, 2000) suggested limiting the influence of these two response styles on the

measurement of a trait by simultaneously using items that are indicative of that trait and items

that are counter-indicative of that trait. Both kinds of items were included in the measurement

instrument of customer satisfaction (see Section 5). The extreme response style is defined as a

general preference for extreme response categories (i.e., the endpoints) of item scales, and the

midpoint response style is defined as a general preference for the middle response category of

item scales. These two response styles also may be investigated by means of control scales

(e.g., Baumgartner & Steenkamp, 2001, 2006; Bronner & Kuijlen, 2007; Greenleaf 1992a,

1992b). For example, control scales may be used to measure general midpoint responding and

general extreme responding, and the corresponding scores may be correlated with

measurements of the trait of interest in order to assess the influence of stylistic responding on

the measurement of the trait. Noncontingent responding refers to the tendency to respond

randomly to items. This response style may be investigated by means of multivariate outlier

analyses (e.g., Tabachnick & Fidell, 1997, pp. 74-75). Socially desirable responding refers to

the tendency of persons to make themselves look good by providing socially desirable

responses to the items. This response style may be investigated by means of control scales

(e.g., Paulhus, 1991).

Stylistic responding is a threat to validity of measurement. Messick (1991; also, see

Jackson & Messick, 1958) argued that stylistic responding is inversely related to the extent

that responses of persons to items are content-driven. This is an important stance. First, this

stance implies that stylistic responding is inhibited by optimising. Optimising (Krosnick,

1991, 1999) is response behaviour that is characterised by giving much consideration to the

accuracy of the responses. For example, when a person puts effort in understanding an item

and in providing the optimal response to the item, he or she is said to optimise (Krosnick,

1999, p. 546-547). Second, this stance implies that stylistic responding is enhanced by

satisficing. Satisficing (Krosnick, 1991, 1999) is response behaviour that is characterised by

giving little consideration to the accuracy of the responses. For example, when a person does

not spend effort to generate the most accurate answer to a question but settles for a merely

satisfactory one, he or she is said to satisfice (Krosnick, 1999, p. 548). Third, Messick’s

(1991) stance implies that the conditions that enhance satisficing also enhance stylistic

responding. These conditions are (a) task difficulty, (b) persons’ abilities, and (c) persons’

motivation to optimise (Krosnick, 1999, p. 548).

It is beyond the scope of this study to assess the contamination of scale scores by all

method-related phenomena. For this reason, it was decided to start the study into effects of

these phenomena by addressing four issues that were important for further applications of the

instrument, and for satisfaction research in general. First, it was investigated whether the

location of satisfaction items in the questionnaire influenced satisfaction scores. Second, it

was investigated whether the presentation mode of response alternatives of satisfaction items

influenced satisfaction scores. Third, it was investigated whether persons’ positions on the

midpoint response style influenced satisfaction scores. Fourth, it was investigated whether

persons’ positions on the extreme response style influenced satisfaction scores.

The hypotheses

The expectations and questions with respect to construct representation and irrelevant

variance were formalised in a set of hypotheses. The hypotheses are listed in Table 4.

Table 4: List of Hypotheses Explicit construct representation

H1 Customer satisfaction is manifested in various expressions that are mutually related but not

sharply delineated

H2 The satisfaction items constitute a scale according to the MH model

H3 The satisfaction scores are positively related to other satisfaction scores

Implicit construct representation H4 Satisfaction scores are positively related to trust scores

H5 Satisfaction scores are positively related to quality scores

H6 Satisfaction scores are positively related to loyalty scores

H7 Satisfaction scores are positively related to future customer profitability

Concept related irrelevant variance H8 The satisfaction scores are not contaminated by trust

H9 The satisfaction scores are not contaminated by quality

H10 The satisfaction scores are not contaminated by loyalty

H11 The satisfaction scores are not contaminated by current customer profitability

Method related irrelevant variance

H12 The satisfaction scores are not affected by the location of items in the questionnaire

H13 The satisfaction scores are not affected by the presentation of the response categories of

satisfaction items

H14 The satisfaction scores are not affected by the midpoint response style

H15 The satisfaction scores are not affected by the extreme response style

Chapter 5

Method of the first empirical study into customer satisfaction with BANK

1 Introduction

This chapter addresses the method of the first empirical study into customer satisfaction with

BANK. The chapter provides an outline of the operationalisations of the constructs, and the

construction of the questionnaire, the pre-tests, the pilot study, and the main study.

2 Operationalisations

Customer satisfaction

Customer satisfaction was operationalised by means of nine Likert items (Table 3, Chapter 4)

with five ordered response categories each, ranging from totally agree (which was scored 4)

to totally disagree (which was scored 0) (Table 1). The nine items were expected to constitute

a unidimensional scale after re-scoring the counter-indicative items (Chapter 4).

Table 1: Items Reflecting Customer Satisfaction with BANK Code Item Aspect Score range

Q3a At BANK I feel at home Affect 0 – 4

Q3b I am satisfied with BANK General satisfaction 0 – 4

Q3d* There are good reasons to leave BANK Cognition 0 – 4

Q3e* I have mixed feelings about BANK Affect 0 – 4

Q3g BANK meets all my requirements for a bank Need fulfilment 0 – 4

Q4a Last year I had a pleasant relationship with BANK Affect 0 – 4

Q4b BANK has met my expectations Disconfirmation 0 – 4

Q4c* I have regretted my choice for BANK Regret 0 – 4

Q4d* Last year I had some problems with BANK Cognition 0 – 4

* = item is counter-indicative of customer satisfaction with BANK

American Customer Satisfaction Index

Customer satisfaction was also operationalised by means of a measurement instrument

adopted from the American Customer Satisfaction Index (ACSI; e.g., Fornell et al., 1996).

This instrument (Table 2) consisted of three items with ten ordered response categories each,

ranging from very negative (e.g., very dissatisfied, which was scored 0) to very positive (e.g.,

very satisfied, which was scored 9). The three items were expected to constitute a

unidimensional scale (see Chapter 3). The instrument is further denoted as the ACSI.

Table 2: American Customer Satisfaction Index Code Item Score range

Q20b How satisfied are you with BANK? 0 – 9

Q20c To what extent does BANK meet your ideal of a bank? 0 - 9

Q20d To what extent has BANK met your expectations? 0 - 9

Following Morgan and Hunt (1994), trust was defined as a person’s confidence in the

reliability and integrity of the company. On the basis of the definition of trust, a set of seven

Likert items was formulated. Each item had five ordered response categories that ranged from

totally agree (which was scored 4) to totally disagree (which was scored 0). Two items were

counter-indicative of trust, and covered distrust. The seven items are listed in Table 3.

In the context of retail banking, confidence in integrity and confidence in reliability are

intertwined (see also Chapter 3). Many expectations, such as the expectation that the company

will keep its promises and the expectation that the company will handle the banking matters

of a person properly, encompass both confidence in the reliability of the company and

confidence in the integrity of the company. Consequently, we expected the seven items to

constitute a unidimensional scale.

Table 3: Items Reflecting Trust Code Item Aspect Score range

Q5a I can depend on BANK to treat me fairly Integrity 0 - 4

Q5b I can depend on BANK to handle my banking affairs correctly Both 0 - 4

Q5c I can depend on BANK to keep its promises Both 0 - 4

Q5d* I sometimes doubt the competence of BANK Reliability 0 - 4

Q5e* I sometimes doubt the good will of BANK Integrity 0 - 4

Q5f I can trust BANK Both 0 - 4

Q5g I can depend on BANK to serve me well Both 0 - 4

*= item is counter-indicative of trust

Quality

In Chapter 3, quality was defined as a person’s perception of the quality of attributes of

products and services provided by the company. This definition is in agreement with the

conception of quality as perceived quality, which implies that quality had to be measured by

means of a psychological measurement instrument. Because quality pertains to distinct

attributes of products and services provided by the company, we expected the instrument to

yield a multidimensional measurement of quality. Furthermore, we expected the combination

of a customer’s positions on these dimensions to drive customer satisfaction (Chapter 3,

Section 5).

Wirtz and Bateson (1995; also Wirtz 2000) demonstrated that halo effects influenced

several measurements of quality, meaning that responses to items about quality of attributes

of products or services provided by the company were influenced by general satisfaction with

the company. The occurrence of halo effects may have been enhanced by the

operationalisations of quality. To control for halo effects, we decided to operationalise quality

in two different and concrete and detailed ways, which we hoped would stimulate the

respondent to contemplate about the quality of distinct attributes of products and services

rather than provide an overall and perhaps too impressionistic global evaluation.

First, quality was operationalised by means of a set of items regarding the experience of

problems with BANK in the preceding twelve months. A listing of problems was assessed on

the basis of an inventory of customer complaints with the company, and previous research

into drivers of customer satisfaction (e.g., Terpstra & Van Gastel, 2004). A total of 16

problems, thus defining 16 items, was included in the questionnaire (Table 4). Persons were

asked whether or not these problems had occurred to them in the preceding twelve months.

The response yes was scored 1, and the response no was scored 0. It was expected that the 16

items were not correlated or weakly correlated.

Second, quality was operationalised by means of a set of 24 items measuring

judgements about attributes of the products and services provided by the company (Table 5).

Each item had four ordered response categories that ranged from excellent (which was scored

3) to bad (which was scored 0). The set of attributes was assessed on the basis of previous

satisfaction research of the company (Terpstra & Van Gastel, 2004), and covered a broad

range of topics. Because it covered a broad range of topics, it was expected that the items

constituted multiple scales.

Table 4: Items Reflecting Quality. All Items are Counter-Indicative of Quality. Code Problem Score range

Q6a Errors in the execution of your banking affairs 0 - 1

Q6b Errors in the execution of your orders 0 - 1

Q6c Insufficient information on your banking affairs 0 - 1

Q6d Ambiguous information on your banking affairs 0 - 1

Q6e Unfair costs of banking services 0 - 1

Q6f Slow service 0 - 1

Q6g Slow money transfers 0 - 1

Q6h Not keeping an appointment 0 - 1

Q6i Insufficient accessibility by telephone 0 - 1

Q6j Insufficient accessibility by Internet 0 - 1

Q6k Insufficient accessibility of offices 0 - 1

Q6l Insufficient response to questions 0 - 1

Q6m Problems with debit cards 0 - 1

Q6n Problems with cash withdrawels 0 - 1

Q6o Problems with internet banking 0 - 1

Q6p Another problem 0 - 1

Table 5: Items Reflecting Quality Code Item Score range

Q7a Correct execution of orders 0 - 3

Q7b Speed of money transfers 0 - 3

Q7c Speed of service delivery 0 - 3

Q7d Adherence to promises 0 - 3

Q7e Correct execution of banking matters 0 - 3

Q7f Distribution of bank statements 0 - 3

Q8a Costs of accounts of the company 0 - 3

Q8b Convenience of products and services 0 - 3

Q8c Clarity of information provided 0 - 3

Q8d Sufficiency of information provided 0 - 3

Q8e Costs of services of the company 0 - 3

Q8f Interest rates of the company 0 - 3

Q9a Service by telephone 0 - 3

Q9b Service by the Internet 0 - 3

Q9c Service by bank offices 0 - 3

Q9d Service by mail correspondence 0 - 3

Q9e Accessibility of the company 0 - 3

Q9f Facilities for Internet banking 0 - 3

Q10a Friendliness of employees 0 - 3

Q10b Capability of employees 0 - 3

Q10c Reliability of employees 0 - 3

Q10d Openness for questions 0 - 3

Q10e Responsiveness of the company 0 - 3

Q10f Handling of complaints 0 - 3

Customer loyalty

Following Gremler and Brown (1996, 1999), customer loyalty was defined as the degree to

which a customer is doing repeat business with the company, possesses a positive attitudinal

disposition towards the provider, and considers only this provider when a need for this

service arises. According to this definition, customer loyalty encompasses (a) cognitions,

affects, and behaviour with respect to the company, and (b) a comparison of the company

with other firms. On the basis of this definition, a set of six Likert items was constructed to

operationalise customer loyalty (Table 6). Each item reflected a particular aspect of customer

loyalty (i.e., cognition, affect, or past behaviour), and had five ordered response categories

ranging from totally agree (which was scored 4) to totally disagree (which was scored 0). In

accordance with former studies using similar measurement instruments of customer loyalty

(e.g., Caruana, 2002; Gremler & Brown, 1999), we expected the six items to constitute a

unidimensional scale.

Table 6: Items Reflecting Customer Loyalty Code Item Aspect Score range

Q14a If I need new financial products, BANK is my first choice Cognition 0 – 4

Q14b I have more sympathy for BANK than for other banks Affect 0 – 4

Q14c* For some matters I am better of with another bank Cognition 0 – 4

Q14d* I consider switching from BANK to another bank Cognition 0 – 4

Q14e BANK offers me benefits other banks don’t offer Cognition 0 – 4

Q14f For many years BANK has been my primary bank Behaviour 0 – 4

* = item is counter-indicative of customer loyalty

Customer profitability

In Chapter 3, customer profitability (CP) was defined as the gross financial contribution of a

customer to a company in a specified period of time. Because a long time period is less

subject to behavioural anomalies than a short time period (Mulhern, 1999), we chose a time

period of a year for the measurement of CP. Thus, CP at time t was the gross financial

contribution of a customer to a company in the twelve months preceding time t.

CP consisted of interest profits and provision profits. Interest profits and provision

profits were a function of the balances held or the provisions paid by a customer on the one

hand, and the corresponding gross margins of the company on the other hand (the gross

margins are the margins of the company before the costs for servicing the customer, such as

transaction costs, contact costs, marketing costs, and overhead costs, are accounted for; see

for example Cooper & Kaplan, 1991, p. 469). For example, if a customer held 1000 euro

credit balance during one month, and the companies’ gross margin on 1 euro credit balance

was 0.002 euro per month, the interest profits yielded by the customer were equal to 2 euro.

The summation of all profits from a customer over 12 months preceding time t was labeled

CP at time t.

Three additional remarks are in order. First, CP at time t was computed monthly by the

company, and expressed in euro. The CP-figures from September 2005, September 2006, and

September 2007 were collected from the internal databases of the company (Section 6 of the

present chapter). Second, if an account (e.g., a mortgage) was held by two or more customers,

one of these customers was registered by the company as the primary owner of the product.

Only the accounts for which the customer was registered as the primary owner were included

in the calculation of profitability of the customer. Third, if a customer left the company, the

customer did not generate any profits from that month onwards, and after a year the profits

generated by this customer in the preceding twelve months were reduced to zero. The

company registered this as a missing value on CP at time t, but in this study this missing value

actually represents zero profits.

Interest

Interest was measured in order to test the quality of the survey data by means of correlating

items reflecting customer satisfaction and items reflecting interest (to be discussed in Section

2 from Chapter 6). We expected that items reflecting customer satisfaction were uncorrelated

with items reflecting interest, and a different result would raise suspicion about the quality of

the survey data. A customer’s interest in banking matters was operationalised on the basis of

two items (Table 7). Each item had five ordered response categories that ranged from highly

interested (which was scored 4) to not interested (which was scored 0). We expected the

items to be positively correlated.

Table 7: Items Reflecting Interest Code Item Score range

Q17 How interested are you in banking matters? 0 - 4

Q18 How interested are you in the development of new products and services

by banks?

3 The questionnaire

The questionnaire (Appendix 1; in Dutch) was composed of the items reflecting customer

satisfaction (represented by two item sets), trust, quality (also represented by two item sets),

customer loyalty, and interest. In addition, some items were included in the questionnaire for

business purposes, and some other items were included to optimise the design of the

questionnaire. For example, some items regarding product ownership and contacts with the

company were included in order to elicit the participant’s memories of the company before

the measurement of satisfaction with the company started. Furthermore, some items regarding

relations of the participant with other providers of financial services were included in order to

elicit his or her memories of other providers of financial services before proceeding with the

measurement of loyalty with the company.

The design of the questionnaire, the format of the items, and the wording of the items

were based upon general principles concerning survey research (see, e.g., Belson, 1986;

Dillman, Tortora, & Bowker, 1998; Sheatsley, 1983: Sudman & Bradburn, 1982). An

important issue was the inclusion of the no answer option among the response options of the

items. It is well known that items allowing respondents to use a no answer option may

provide problems in data analysis (e.g., Tabachnick & Fidell, 2007, pp. 62-63), and that a no

answer option may invoke satisficing (e.g., Krosnick, 1999; Krosnick & Fabrigar, 1997).

Nevertheless, because of four reasons it was decided to maintain the no answer option of

items:

(a) Interviews with participants after they had taken pre-tests of the questionnaire

revealed that they appreciated the no answer option. They claimed that they could not

answer particular items if they had no experience with the subject. An example of

such an item concerned the handling of complaints by the company. It was

considered useful to include these items in the questionnaire, in particular to collect

data on the seriousness of a particular problem.

(b) To limit the risk of satisficing (Krosnick, 1999), the item texts were kept short,

simple, and concrete in order to limit the difficulty of the participants’ task and

prevent participants from taking the easy way in answering the items thus using the

no answer option too light-heartedly;

(c) A pilot study (to be discussed in Section 5) demonstrated that the no answer option

was rarely used with respect to the satisfaction items. The response option apparently

did not invoke satisficing on this subject; and

(d) A practical reason for using the no answer option was that the questionnaire was to

be administered via the Internet. The administration mode encompassed a forcing

mechanism that required the participant to respond to an item before proceeding to

the next item. Such a mechanism may contaminate the data, because a participant

may have good reasons not to answer a particular question (Dillman et al., 1998).

Thus, the no answer option was also meant to neutralise the forcing mechanism.

The ordering of items within a block of items, such as the items within block Q3

(Appendix 1), was different across different administrations of the questionnaire. The effect

of the location of the satisfaction items (Q3, Q4, and Q20; Appendix 1) on the scale scores

was assessed in the pilot study. The objective of these measures was to test and to control for

order effects.

The questionnaire was improved by means of qualitative pre-tests among 10 persons and

a pilot study among 372 persons. The pre-tests (to be discussed in Section 4) demonstrated

that it took 15 to 35 minutes for participants to complete the questionnaire. We considered

this rather long and suspected this might demoralise participants, and stimulate satisficing

(e.g., Krosnick, 1999, pp. 248-249, Sheatsley, 1983, p. 223; Sudman & Bradburn, 1982, p.

262). In order to motivate participants to complete the questionnaire, we explained the

purpose of the study in the E-mail (Appendix 2; in Dutch) by which they were invited to

participate in the survey.

4 The pre-tests

The questionnaire was pre-tested between February 2005 and May 2005, by means of depth

interviews with mature customers of BANK. The first objective of the pre-tests was to test

how long it took participants to complete the questionnaire and to explore participants’

interpretations of the items in the questionnaire. The second objective of the pre-tests was to

test the first hypothesis of the empirical study (i.e., customer satisfaction is manifested in

various expressions that are mutually related but not sharply delineated; see Section 7 in

Chapter 4). The results of the pre-tests were used to improve the wording of the items and the

design of the questionnaire, before executing the pilot study and the main study. Furthermore,

the results were used to test the first hypothesis.

Target population

The target population of this study consisted of the mature customers of a Dutch retail bank.

These were adults who were registered by the company as the primary owner of at least one

banking product provided by the company.

Sample

The sample was composed of ten mature customers of the bank. Four were male and six were

female. Their age varied between 29 and 71 years. Their education ranged from professional

to academic. None of the persons was occupied in consumer research or the financial services

industry.

Procedure

The questionnaire was presented in paper-and-pencil format to the participant. The participant

filled out the questionnaire, and the interviewer registered the time it took to complete the

questionnaire. Afterwards, the interviewer interviewed the participant. The participant was

probed into his or her satisfaction with the company, into the meaning that he or she attached

to satisfaction with a retail bank, and into the answers he or she had given to the survey items.

The responses were registered on paper by the interviewer.

The interviewer’s notes about the time span of the survey and the responses of participants to

the post-survey interview constituted the raw data.

5 The pilot study

The pilot survey was conducted in August 2005, among mature customers of the bank. The

first objective of the pilot study was to test the procedure of the survey. It was assessed (a)

how many participants completed the questionnaire, (b) how often missing values on items

occurred, and (c) what kind of comments the participants made with respect to the

questionnaire. The second objective was to test the hypotheses 12 and 13 (the hypotheses

regarding the effect of (a) location of satisfaction items and (b) ordering of response

categories on scale scores; see Section 7 in Chapter 4). The results of the pilot study were

used to decide on technical properties of the main survey, and to test the hypotheses 12 and

Design

Four versions of the questionnaire were administered that differed with respect to the location

of the satisfaction items in the survey, and the ordering of the response categories of the

satisfaction items. On the basis of this design (Table 8) it was tested whether (a) the location

of the satisfaction items in the questionnaire, and (b) the ordering of the response categories

of satisfaction items, had an effect on the average satisfaction scores.

Table 8: Design of the Pilot Study Survey version location of items ordering of categories N

1 A A 90

2 A B 95

3 B A 89

4 B B 98

The location of the satisfaction items refers to the location of Q3, Q4 and Q20

(Appendix 1) in the questionnaire. In the survey versions 3 and 4, the locations of Q3 and Q4

on the one hand and Q20 on the other hand were reversed. The order of response categories

refers to the response categories of the Likert items, which were totally agree – agree –

neutral – disagree – totally disagree. In the survey versions 2 and 4, the response categories

were displayed in reversed order.

Target population

Sample

The sample was drawn from the research panel of the company. This panel was composed of

a total of 3984 mature customers of the company who had agreed to participate in marketing

research via the Internet. The agreement encompassed that (a) the company is free to

approach the person for marketing research, (b) the person is free to participate in the research

or to decline, (c) the company is allowed to use the survey data for research purposes only,

and (d) the company is not allowed to distribute any personalised data to third parties. All

panel members could be approached by E-mail, and had a unique customer-id that was used

for identification purposes.

The reasons for using the research panel for this study were (a) its considerable size, (b)

its facilities for Internet research, and (c) the availability of a customer-id for each panel

member. The customer-id facilitated the enrichment of the survey data with the company data

that were needed in this study. The arguments in favour of the use of the research panel

outweighed the argument against the panel, which was the possibility that the panel might be

biased with respect to some psychological characteristics. For example, it cannot be ruled out

that (a) persons who were willing to participate in the panel had a different attitude towards

banking than persons who were not willing to participate in the panel, and (b) persons who

had access to the Internet had different psychological characteristics than persons who do not

have access to the Internet. Thus, the choice for using the research panel may have enhanced

coverage error (i.e., error due to the result that different units in the target population have

different probabilities of being included in the sample; e.g., Dillman & Bowker, 2001;

Groves, 1989).

Three additional remarks with respect to the research panel are in order. First, the

variable customer segment refers to a segmentation which reflects the value of the customers

to the company, and which was used by the company for marketing purposes. The company

distinguished three segments, which were Top Customers, Standard Customers, and

Development Customers. Each customer of the company, except the ones that were not

administered as the primary owner of a product provided by the company, was segmented in

one and only one of these segments. Because the company’s most valuable customers (i.e.,

Top Customers) were overrepresented in the research panel, the panel differed significantly

(χ2(2) = 1270, p < 0.001) from the target population with respect to the distribution of

customer segment (Table 9). Second, the panel differed significantly (χ2(2) = 324, p < 0.001)

from the target population with respect to the distribution of gender. Males were

overrepresented in the panel (Table 9). This was partly due to the overrepresentation of males

among the segment Top Customers (i.e., the segment that was overrepresented in the research

panel), and partly to unknown causes. Third, the panel differed significantly (χ2(2) = 299, p <

0.001) from the target population with respect to the distribution of age group (Table 9). The

average age in the panel was 47 years, and in the target population it was 48 years. The

average age in the target population appears to be high, but this is because only adults

constituted this population.

In total, 800 persons were invited to participate in the survey. These persons were

selected randomly from the research panel. The response rate in the pilot study was

approximately 47% (N = 372), and the participants were distributed more or less evenly

across the four versions of the questionnaire (Table 8). The distributions of customer segment,

gender, and age group within subsequently the company, the panel, and the sample are

reported in Table 9.

In line with our expectations, the sample differed significantly from the target

population with respect to customer segment (χ2(2) = 209, p < 0.001), gender (χ2(2) = 42, p <

0.001), and age group (χ2(2) = 35, p < 0.001). Furthermore, the sample differed significantly

from the panel with respect to customer segment (χ2(2) = 16.91, p < 0.001). Thus, respondents

differed significantly from non-respondents with respect to customer segment. The sample

was representative of the panel with respect to gender and age group.

Table 9: Distribution (Percentages) of Customer Segment, Gender, and Age Group in the Pilot Study Company Panel Sample

Customer segment

Top 30 56 64

Standard 44 32 30

Development 26 12 6

Gender

Female 44 31 30

Male 52 66 68

Unknown 4 3 2

Age group

18 to 39 years 35 30 28

40 to 59 years 38 51 52

60 years and older 27 19 20

Procedure

The survey was administered via the Internet. Persons were invited by E-mail to participate in

the survey. The questionnaire was made available at a site of the marketing research agency

that managed the survey. The questionnaire was accessible from 19 August 2005 until 4

September 2005. Persons had access to the site on the basis of a password and were identified

on the basis of a customer-id. After a participant completed the questionnaire, the data were

uploaded to the agency. The participants received a small incentive (i.e., saving points valued

10 euro). This is the common fee that the company paid to panel members that responded to a

survey of medium length.

The research agency yielded a file containing the raw data, which were the coded responses of

the participants to the survey items (the research agency scored a no answer response as a

missing value). In order to enrich the raw data, the file was merged with the marketing

database. The merging was executed on the basis of customer-id, and it was successful for all

participants. Subsequently, three variables were added to the file, which were (a) customer

segment ultimo September 2005, (b) gender, and (c) age ultimo September 2005.

6 The main study

The main survey was conducted in October 2005, among mature customers of the bank. The

study was used to construct the measurements of the constructs, and to test the hypotheses

(see Section 7 in Chapter 4).

Target population

Sample

A total of 3612 persons were invited to participate in the survey. They were the remainder of

the research panel of the company (i.e., the part of the panel that did not participate in the

pilot study). The response rate in the main study was approximately 47% (N = 1689). The

distributions of customer segment, gender, and age group within subsequently the company,

the remainder of the panel, and the sample are reported in Table 10.

In line with our expectations, the sample differed significantly from the target

population with respect to customer segment (χ2(2) = 813, p < 0.001), gender (χ2(2) = 183, p <

0.001), and age group (χ2(2) = 157, p < 0.001). Furthermore, the sample differed significantly

from the remainder of the panel with respect to customer segment (χ2(2) = 75, p < 0.001),

gender (χ2(2) = 9.95, p < 0.01), and age group (χ2(2) = 8.85, p < 0.05). Thus, respondents

differed significantly from non-respondents with respect to customer segment, gender, and

age group. For gender and age the absolute differences were very small, and for practical

purposes they may be ignored.

Table 10: Distributions (Percentages) of Customer Segment, Gender, and Age Group in the Main Study Company Remainder of Panel Sample

Customer segment

Top 30 55 61

Standard 44 32 30

Development 26 13 9

Gender

Female 44 31 30

Male 52 66 68

Unknown 4 3 2

Age group

18 to 39 years 35 30 28

40 to 59 years 38 51 52

Procedure

The survey was administered via the Internet. Persons were invited by E-mail to participate in

the survey. The questionnaire was made available at a site of the marketing research agency

that managed the survey. The questionnaire was accessible from 30 September 2005 until 16

October 2005. Persons had access to the site on the basis of a password and were identified on

the basis of a customer-id. After a participant completed the questionnaire, the data were

uploaded to the agency. The participants received a small incentive (i.e., saving points valued

10 euro). This is the common fee that the company paid to panel members that responded to a

survey of medium length.

the participants to the survey items (again, a no answer response was scored as a missing

value). In order to enrich the raw data, the file was merged with the marketing database. The

merging was executed on the basis of customer-id, and it was successful for all participants.

Subsequently, seven variables were added to the file, which were (a) customer segment ultimo

September 2005, (b) gender, (c) age ultimo September 2005, (d) CP ultimo September 2005,

(e) CP ultimo September 2006, (f) CP ultimo September 2007, and (g) indicator whether the

customer had deceased between September 2005 and September 2007.

Chapter 6

Results of the first empirical study into customer satisfaction with BANK

1 Introduction

This chapter addresses the results of the first empirical study into customer satisfaction with

BANK. First, the preliminary analyses are discussed. The purpose of these analyses was to

examine the data quality and to prepare the data for the subsequent analyses. Second, the

measurement analyses are discussed. The purpose of these analyses was to construct the

scales of customer satisfaction, trust, quality, and customer loyalty. Third, the tests of the

hypotheses explained in more detail in Chapter 4 are discussed. The purpose of these tests

was to collect empirical evidence regarding the validity of measurement of customer

satisfaction. Fourth, additional research into the relation between customer satisfaction and

future customer profitability (future CP) is discussed. The purpose of these analyses was to

explore this relation in more detail than we did for the tests of the hypotheses. Fifth, the

implications of the results of the empirical study are addressed. The discussion includes the

assessment of the strengths and weaknesses of the customer satisfaction scale. Sixth, the

conclusions of the study are presented.

2 Preliminary analyses

Method

This section addresses the preliminary analyses of the raw data from the pre-tests, the pilot

study and the main study.

Pre-test data

First, the data from the pre-tests were analysed. The interviewer reproduced the interviews

verbatim on the basis of the notes he made during the interview. The report of each interview

included (a) the registration of the time the participant took to complete the survey, (b) the

participant’s explanation of his or her satisfaction with the retail bank, and (c) the

participant’s comments on the survey and the questionnaire items.

Pilot study data

Second, the data from the pilot study were analysed. For this purpose, the dataset containing

the raw data was converted into a SAS dataset, and the items that were assumed to be counter-

indicative of the constructs (see the description of the measurement instruments in Chapter 5)

were recoded in the opposite direction. In order to get an impression of the distribution

characteristics of the variables, histograms and descriptive statistics of all variables in the

dataset were computed and examined. For this purpose, proc univariate (SAS STAT) and

proc means (SAS STAT) were used.

In order to test the data quality, the correlations between the items reflecting customer

satisfaction with the retail bank and the items reflecting interest in banking matters were

examined. For this purpose, proc corr (SAS STAT) was used. It was expected that, (a) the

items reflecting satisfaction were highly correlated, (b) the items reflecting interest were

highly correlated, and (c) the items reflecting satisfaction and the items reflecting interest

were uncorrelated.

Missing data may hamper the data analyses (e.g., Tabachnick & Fidell, 2007, p. 62).

Item-score imputation is a method for handling missing item scores in multiple-item

questionnaires. Suppose, the score of participant p on item i is missing. Then, the imputation

of an item score based on the observed part of the data for participant p and item i, to be

discussed shortly in more detail, is an effective and simple way to complete the data matrix

and not lose a large part of the sample, as with the popular missing data handling by means of

listwise deletion.

In the statistical literature (e.g., Little & Rubin, 2002; Schafer & Graham, 2002), it is

well known that the way in which missing data have to be handled depends on the mechanism

that underlies the missingness. This mechanism often is difficult to identify once the missing-

data problem has presented itself, and this complicates adequate missing-data handling in

much empirical research. For item-score missingness in multiple-item questionnaires, in

which multiple items are used to measure one underlying construct such as satisfaction,

Bernaards and Sijtsma (2000) and Van Ginkel, Van der Ark, and Sijtsma (2007) found that

imputation of item scores has little or no biasing effect on outcomes of statistical analyses

when the percentage of missing item scores in the data matrix does not exceed, say, 15

percent. Serious bias is absent even when the missingness mechanism cannot be ignored in

the sense that the missing item scores cannot be considered a random sample from the

complete data matrix. The explanation for this robustness is that the available data contain

much information on the underlying construct, and thus are well able to compensate for the

non-randomness of the missing data. Because in the pilot study and the main study the total

percentage of missing item scores did not exceed 15, item-score imputation could be used

safely (results discussed in the next section).

For the imputation of item scores, we used two-way imputation with normally

distributed errors (abbreviated method TW-E; e.g., Bernaards & Sijtsma, 2000; Sijtsma & van

der Ark, 2003; Van Ginkel, 2007; Van Ginkel, Van der Ark, & Sijtsma, 2007). Van Ginkel

(2007) demonstrated that this method yielded nearly unbiased results in important

psychometric quantities such as Cronbach’s alpha. Method TW-E is suited in particular for

item sets that measure one construct. Let the score of person p on item i be missing. In two-

way imputation, a real value TWpi is estimated on the basis of (a) the mean of person p’s

available scores on the other items of the scale (i.e., the person mean PMp), (b) the mean of

the available scores of the other persons in the sample on item i (i.e., the item mean IMi), and

(c) the mean of all available scores of all persons in the sample on all items which constitute

the scale (i.e., the overall mean OM), so that

TWpi = PMp + IMi – OM.

In two-way imputation with normally distributed errors, a random error εpi is added to TWpi,

so that

TWpi(E) = TWpi + εpi.

The random error is drawn from a normal distribution with zero mean and variance σε2.

Variance σε2 is obtained from the squared differences between the observed scores Xpi in the

data matrix, and the expected scores TWpi computed by means of method TW-E. If TWpi(E) is

a real number, it is rounded to the nearest integer within the range of feasible item scores, and

this rounded value is imputed in cell (p,i) of the data matrix.

Method TW-E requires that at least one item from the item set reflecting a construct is

answered by the participant. Otherwise, the person mean PMp, and consequently TWpi, cannot

be computed. Thus, no values were imputed for missing scores of participants who did not

answer at least one item from a particular scale.

We excluded participants with missing scores from particular analyses if it was plausible

that missing scores on an item were due to the item being non-applicable for these

participants. For example, missing scores on an item addressing quality of complaint handling

by the company (i.e., item Q10f; see Table 5 in Chapter 5) may be due to the item being non-

applicable for participants who never had a complaint about the company. Because it is

unrealistic to impute a score for a missing value that indicates that an item may not be

applicable for a participant, we did not impute values for these missing scores but rather

excluded this case from the analysis (also, see Chapter 5, in which the decision to include the

no answer option was discussed). We excluded variables from the dataset if it was suspected

that (a) the missingness was nonignorable, (b) there were no substantive arguments for the

imputation of the missing scores, and (c) the variables were considered to be dispensable for

the study. For example, two variables reflecting customer loyalty were deleted from the

dataset because of these reasons (to be further discussed in the Section Results).

Main study data

Third, the data from the main study were analysed, similar to the data from the pilot study.

These analyses included (a) the recoding of items that were assumed to be counter-indicative

of the construct of interest, (b) the examination of distribution characteristics of the variables

in the dataset, (c) the imputation of missing values, and (d) the examination of correlations

between items reflecting satisfaction and items reflecting interest in banking matters.

Furthermore, in the main study (but not in the pilot study) a weighting factor containing

weights for persons in the dataset was computed, and outlier analyses were done.

In Chapter 5, it was demonstrated that the sample differed significantly from the target

population with respect to customer segment, gender, and age group. The analyses

demonstrated that the difference with respect to customer segment between the sample and

the target population was larger than the differences with respect to gender and age group.

Because in-company research demonstrated that customer segment is an important variable in

customer profitability analyses (e.g., Terpstra, 2005) and because we intended to analyse the

relation between customer satisfaction and customer profitability, we decided to weight

participants in order to obtain proportional representation of customer segments in the sample.

Hox (1998) advocated weighting of persons if the sample is biased, and comparing the results

from statistical analyses with and without weighting. Following Hox (1998), we compared the

results of the analyses regarding the relation between customer satisfaction and future

customer profitability, with and without the weighting (Section 4). The weights of the

participants belonging to a particular customer segment were computed as the ratio between

the proportion of the customer segment in the company population and the proportion of the

customer segment in the sample. This means that the participants belonging to a customer

segment that was overrepresented in the sample were given a smaller weight than the

participants belonging to a segment that was underrepresented in the sample.

Univariate and multivariate outlier analyses were conducted to find cases that may

hamper the data analyses (e.g., Tabachnik & Fidell, 2007, pp. 72-77). For the detection of

univariate outliers, the histograms of variables were examined. For the detection of

multivariate outliers, the distances of persons to the centroid of the multivariate space defined

by the items in the dataset were examined. These distances can be expressed by the

Mahalanobis Distance (Mahalanobis, 1936) and by the leverage statistic, which is a function

of the Mahalanobis Distance (Tabachnick & Fidell, 2005, pp. 74-75). Let MD denote the

Mahalanobis Distance, and N the sample size, then for person p his/her leverage, denoted hpp,

is defined as:

hpp = (MD / N - 1) + (1 / N).

We chose the leverage statistic for the detection of multivariate outliers, because this statistic

is readily available in SAS. Following Tabachnick & Fidell (2007, pp. 74-75, 111-112),

regression analysis was used to calculate the leverage statistic. This was done using several

items that reflected different constructs as predictors and customer-id as criterion (because the

leverage statistic expresses the distances of persons to the centroid of the multivariate space

defined by the predictor variables in the regression analysis, the choice of the criterion

variable in the regression analysis is unimportant). Persons with a significant value for

leverage (p < 0.001) were defined as multivariate outliers, and their score patterns were

visually examined to find out what caused the high leverage value. The outliers were marked

in the dataset by an indicator variable. Furthermore, for each participant the proportion of

missing values on each set of items constituting a measurement instrument was computed. If

this proportion exceeded 0.5, a participant was marked as an outlier. To evaluate the impact of

outliers on the results, we did all analyses on the dataset including the outliers (i.e., the

complete dataset) and on the dataset without outliers (i.e., the reduced dataset).

Results

The pre-tests

The participants explained their satisfaction with the retail bank in different ways. The

participant’s explanations of his or her satisfaction with the retail bank are listed in Table 1,

and they are discussed in Section 4.

Table 1: Listing of Explanations of Satisfaction with the Company Participant Satisfaction Explanation of satisfaction with the retail bank

1 Very

satisfied

I feel good about [BANK]. My banking affairs are taken care of well with

[BANK].

2 Satisfied They [BANK] do nothing wrong. There is nothing to be dissatisfied about

… There is nothing to be enthusiastic about either. If [COMPETITOR]

would have current accounts, I would switch immediately.

3 Satisfied They [BANK] will not deceive you, such as [COMPETITOR] X]. That

was my former bank … [BANK] is easy to deal with, with limited costs.

4 Satisfied I’ve got the impression that they [BANK] will not deceive me, and then it’s

all right with me ... I’m not particularly concerned with banking affairs, my

partner takes care of banking affairs …

5 Very

satisfied

The staff is always friendly, and the bank is easy to deal with… I feel good

about [BANK].

6 Satisfied I trust [BANK] … I won’t go to [COMPETITOR], to me it’s important

that I can trust my bank.

7 Satisfied It [BANK] is a friendly bank … They [BANK] are accessible … There is

nothing to be dissatisfied about.

8 Satisfied They are friendly and they are accessible … Although a relative once had

an annoying incident with [BANK]. Her card was stolen and was used

abroad. First, they [BANK] refused to compensate. This is not what I

expected from [BANK].

9 Satisfied It is all right, it never goes wrong … I don’t care much about banking

affairs … I don’t have any referents.

10 Moderately

satisfied

In general it is all right, but last year I had an incident with [BANK]. It was

about the costs of banking services. They charge basic services, while they

make enormous profits with our money.

The pilot study

Histograms (not shown here) demonstrated that the polytomous items reflecting customer

satisfaction (two item sets), trust, quality (two item sets), customer loyalty, and interest were

single peaked, and mostly negatively skewed. For example, most participants responded

positively to items, which were indicative of satisfaction, and negatively to items, which were

counter-indicative of satisfaction (Table 2). This corresponds with the findings in other

satisfaction studies (e.g., Oliver, 1997; Peterson & Wilson, 1992). The histograms also

revealed a small group of outliers on the items adopted from the ACSI. Because these items

were not used in subsequent analyses of the pilot data, no further actions were undertaken

with respect to these outliers.

Table 2: Descriptive Statistics of Items Reflecting Customer Satisfaction (Before Imputation; N = 372) Code Item Nmiss Mean SD Skewness

Q3a At BANK I feel at home 2 2.94 0.73 -0.72 **

Q3b I am satisfied with BANK 1 2.96 0.69 -0.99 **

Q3d* There are good reasons to leave BANK 0 2.99 0.94 -1.13 **

Q3e* I have mixed feelings about BANK 2 2.72 0.95 -0.85 **

Q3g BANK meets all my requirements for a bank 0 2.67 0.83 -0.61 **

Q4a Last year I had a pleasant relationship with BANK 0 2.88 0.77 -1.36 **

Q4b BANK has met my expectations 0 2.85 0.73 -0.95 **

Q4c* I have regretted my choice for BANK 0 3.27 0.71 -1.20 **

Q4d* Last year I had some problems with BANK 1 2.85 1.03 -0.87 **

* = scored reversely, ** = p < 0.001

The descriptive statistics demonstrated a low incidence of missing values (i.e., smaller

than five percent) on the items reflecting customer satisfaction, trust, customer loyalty, and

interest, and a higher incidence of missing values on some items reflecting quality. This latter

result was probably due to items mentioning topics that were irrelevant to particular

participants; also, see Chapter 5. For the items constituting the measurement instrument for

customer satisfaction, Table 2 shows that there were few missing item scores; thus method

TW-E was used for imputing values for the missing item scores. The descriptive statistics

(i.e., mean, standard deviation, and skewness) for the items before imputation were almost

identical to the descriptive statistics for the items after imputation. This result supports the use

of method TW-E, and the items after imputation were used for subsequent analyses (i.e., the

analyses for the test of the hypotheses 12 and 13; see Chapter 4).

Table 3 shows the correlations between two items reflecting satisfaction (Table 1 in

Chapter 5) and two items reflecting interest in banking matters (Table 7 in Chapter 5). In

agreement with our expectations, (a) the satisfaction items correlated highly, (b) the interest

items correlated highly, and (c) the satisfaction items and the interest items were almost

uncorrelated. These results strengthened our confidence in the quality of the data.

Table 3: Correlations Between Two Items (Q3a and Q3b) reflecting Customer Satisfaction and Two Items (Q17 and Q18) Reflecting Interest Item Code Q3a Q3b Q17 Q18

At BANK I feel at home Q3a 0.72 0.08 0.03

I am satisfied with BANK Q3b 0.01 -0.04

How interested are you in banking matters? Q17 0.65

How interested are you in the development of

new products and services by banks?

The main study

The results from the preliminary analyses of the data from the main study were similar to the

results from the analyses of the pilot data. Histograms (not shown here) demonstrated that all

polytomous items reflecting customer satisfaction, trust, quality, customer loyalty, and

interest were single peaked, and mostly negatively skewed; see Table 4.

For the variables reflecting customer profitability (CP) in September 2005, September

2006, and September 2007, histograms (not shown here) showed single peaked and positively

skewed distributions. Forty-three participants had a standardised CP in September 2005,

2006, or 2007, which was larger than 3. These participants were outliers, but because they had

correct values for CP (i.e., not incorrect values due to, e.g., clerical errors), they were retained

for the data analyses.

Outliers are common in financial data. In the financial services industry, customer

profits (i.e., CP according to the gross CP conception; Chapter 3) often follow a Pareto-like

distribution (i.e., 20% of the customers is responsible for 80% of the company’s profits). To

reduce the skewness of the distribution and the influence of the outliers on subsequent

analyses, we applied a logarithmic transformation to CPt (Jack, 1967; Tabachnick & Fidell,

Table 4: Descriptive Statistics of Polytomous Items Reflecting Customer Satisfaction, Trust, Customer Loyalty and Interest (Before Imputation; N = 1689) Code Item Nmiss Mean SD Skewness

Customer satisfaction items

Q3a At BANK I feel at home 8 2.90 0.73 -0.73 **

Q3b I am satisfied with BANK 2 2.92 0.71 -1.21 **

Q3d* There are good reasons to leave BANK 25 3.01 0.95 -0.98 **

Q3e* I have mixed feelings about BANK 19 2.72 0.97 -0.60 **

Q3g BANK meets all my requirements for a bank 2 2.62 0.87 -0.65 **

Q4a Last year I had a pleasant relationship with BANK 10 2.82 0.71 -0.71 **

Q4b BANK has met my expectations 5 2.77 0.75 -1.00 **

Q4c* I have regretted my choice for BANK 14 3.21 0.76 -1.05 **

Q4d* Last year I had some problems with BANK 14 2.99 0.94 -1.03 **

Trust items

Q5a I can depend on BANK to treat me fairly 9 2.83 0.66 -1.00 **

Q5b I can depend on BANK to handle my banking aff. corr. 4 2.90 0.63 -1.12 **

Q5c I can depend on BANK to keep its promises 16 2.77 0.70 -1.04 **

Q5d* I sometimes doubt the competence of BANK 20 2.78 0.87 -0.72 **

Q5e* I sometimes doubt the good will of BANK 24 2.80 0.88 -0.80 **

Q5f I can trust BANK 4 2.89 0.64 -0.75 **

Q5g I can depend on BANK to serve me well 6 2.75 0.71 -0.84 **

Customer loyalty items

Q14a If I need new fin. products, BANK is my first choice 26 2.44 0.95 -0.41 **

Q14b I have more sympathy for BANK than for other banks 34 2.47 0.90 -0.41 **

Q14c* For some matters I am better of with another bank 124 1.83 1.01 0.30 **

Q14d* I consider switching from BANK to another bank 36 3.01 0.91 -0.95 **

Q14e BANK offers me benefits other banks don’t offer 97 2.18 0.82 -0.04

Q14f For many years BANK has been my primary bank 9 2.96 0.99 -1.08 **

Interest items

Q17 How interested are you in banking matters? 22 2.70 1.03 -0.47 **

Q18 How interested are you in dev. of new p&s by banks? 34 2.27 1.11 -0.22 **

ACSI items

Q20b How satisfied are you with BANK? 7 6.56 1.30 -1.13 **

Q20c To what extent does BANK meet your ideal of a bank? 48 6.11 1.44 -1.08 **

Q20d To what extent has BANK met your expectations? 19 6.54 1.37 -1.23 **

(*) = scored reversely, (**) = p < 0.001

2007, pp. 87-89). Let CPt denote CP at time t, TCPt transformed CPt, and ln the natural

logarithm. Because the minimum values for CPt was zero euro, we applied the following

transformation:

)1ln( += tt CPTCP .

Table 5 shows the correlations between two items reflecting satisfaction (Table 1 in

Chapter 5) and two items reflecting interest in banking matters (Table 7 in Chapter 5). In

agreement with our expectations, (a) the satisfaction items correlated highly, (b) the interest

items correlated highly, and (c) the satisfaction items and the interest items were almost

uncorrelated. These results strengthened our confidence in the quality of the data.

The items reflecting customer satisfaction (including the items from the ACSI), trust,

customer loyalty, and interest had few missing data on (i.e., 5% or less; see Table 4), so that

item-score imputation could be used safely. An exception was made for the items with respect

to customer loyalty; this is discussed shortly. The descriptive statistics of the items before

imputation were almost identical to the descriptive statistics of the items after imputation.

Some participants left more than 50 percent of the items reflecting satisfaction, trust, or

interest unanswered (Table 6). These participants were considered outliers, and indicator

variables identified them in the dataset.

Table 4 demonstrates substantial percentages of missing scores on two items reflecting

customer loyalty, which are the items Q14c (For some matters I am better off with another

bank; Nmiss = 6 percent) and Q14e (BANK offers me benefits other banks don’t offer; Nmiss =

7 percent). The meaning of item Q14c was probably too vague, because the phrase some

matters is ambiguous and imprecise, probably referring to a variety of products and services

that are provided by retail banks. The meaning of item Q14e also was probably too vague,

because the phrase offers me benefits was not articulated. Thus, it is unclear whether this

phrase refers to characteristics of the company, such as the location of a bank office or the

availability of Internet banking facilities, or to financial offers by the company, such as a

personalised interest rate. The unfortunate phrasing of these two items in combination with

the circumstance that these items were dispensable for the study led us to delete the items

from the dataset, even though the percentages of missing item scores were smaller than 15.

The missing data on the remainder of the items reflecting customer loyalty (Table 4) were

imputed using method TW-E. Some participants left more than 50 percent of the items

reflecting customer loyalty unanswered (Table 6). These participants were considered

outliers, and we created an indicator variable to identify them in the dataset.

Table 5: Correlations Between Two Items (Q3a and Q3b) Reflecting Customer Satisfaction and Two Items (Q17 and Q18) Reflecting Interest Item Code Q3a Q3b Q17 Q18

At BANK I feel at home Q3a 0.64 0.05 0.04

I am satisfied with BANK Q3b -0.01 -0.02

How interested are you in banking matters? Q17 0.62

How interested are you in the development of

new products and services by banks?

Table 6: Number of Participants Leaving More Than Half of the Items Unanswered

Satisfaction ACSI Trust Loyalty Interest

N 1 14 6 5 11

Histograms (not shown here) demonstrated that all polytomous items reflecting quality

were single peaked and mostly negatively skewed; see Table 7. The polytomous items

reflecting quality had many missing item scores, part of which may be due to items being

non-applicable for the participants involved (also, see Chapter 5). For example, a missing

score on an item concerning the quality of complaint-handling by the company (i.e., item

Q10f; Table 7) might indicate that the participant never had any complaints with the

company. Similarly, a missing score on an item concerning telephone service by the company

(i.e., item Q9a; Table 7), might indicate that the participant never phoned the company.

Because imputation of values for missing scores on such items would be meaningless, we

decided to exclude persons with missing scores on the polytomous items reflecting quality

from analyses of the data about quality. In general, the regular users of the BANK have a

greater chance of running into problems with transactions and services than the low-frequency

users. Thus, it is likely that the latter group is overrepresented in the missing scores on the

quality items.

In order to detect multivariate outliers, the leverage statistic was computed by means of

a regression analysis using customer-id as the criterion variable, and as the predictor variables

25 items reflecting customer satisfaction (Table 4), trust (Table 4), customer loyalty (Table 4;

the items Q14c and Q14e were excluded), interest (Table 4), and the items from the ACSI

(Table 4) (see Tabachnick & Fidell, 2007, pp. 75-76, 111-112). The analysis yielded 119

participants with a significant (p < 0.001) leverage value. Visual inspection of the data

demonstrated that these participants tended to give extremely positive or extremely negative

responses. Furthermore, the inspection demonstrated that the eight participants with the

highest leverage value alternated extremely positive and extremely negative responses to

different items having similar content. For example, a participant responded extremely

positive to one half of the items reflecting satisfaction with the company (i.e., items Q3a,

Q3b, Q3d, Q3e, and Q3g; Table 4) and extremely negative to the other half of the items

reflecting satisfaction with the company (i.e., items Q4a, Q4b, Q4c, and Q4d; Table 4).

Another example is a participant who answered extremely positive to one item from the ACSI

(i.e., item Q20b; Table 4) and extremely negative to the other items from the ACSI (i.e., items

Q20c and Q20d; Table 4). A third example is a participant who answered extremely negative

to all items reflecting satisfaction with the bank (customer satisfaction items; Table 4) and

extremely positive to the items from the ACSI (ACSI items; Table 4).

It was suspected that the eight participants with the highest leverage value had

responded inconsistently to the survey items. An indicator variable was created to identify

them in the dataset. This variable was joined with the variables marking the participants who

left the majority of items reflecting a particular construct unanswered (see Table 6). The union

of these variables identified 39 outliers in the dataset. These 39 outliers were excluded from

some analyses (the dataset including the 39 outliers was labeled the complete dataset, and the

dataset without the 39 outliers was labeled the reduced dataset).

The weights needed to achieve proportional representation with respect to customer

segment were computed on the basis of the distributions of customer segment in the company

population and in the sample (see Chapter 5). Subsequently, the weights were recorded in a

variable called the weighting factor (Table 8).

Table 7: Descriptive Statistics of Polytomous Items Reflecting Quality (N = 1689) Code Item Nmiss Mean SD Skewness

Q7a Correct execution of orders 11 2.11 0.57 -0.37 **

Q7b Speed of money transfers 12 1.67 0.82 -0.33 **

Q7c Speed of service delivery 37 1.81 0.63 -0.39 **

Q7d Adherence to promises 162 1.87 0.61 -0.63 **

Q7e Correct execution of banking matters 19 2.04 0.53 -0.31 **

Q7f Distribution of bank statements 10 1.38 0.86 -0.19

Q8a Costs of accounts of the company 201 1.09 0.70 0.27 **

Q8b Convenience of products and services 32 1.91 0.58 -0.35 **

Q8c Clarity of information provided 32 1.80 0.62 -0.61 **

Q8d Sufficiency of information provided 51 1.78 0.60 -0.57 **

Q8e Costs of services of the company 94 1.07 0.70 0.31 **

Q8f Interest rates of the company 144 0.85 0.69 0.41 **

Q9a Service by telephone 456 1.81 0.65 -0.47 **

Q9b Service by the internet 325 1.89 0.65 -0.52 **

Q9c Service by bank offices 288 1.71 0.77 -0.55 **

Q9d Service by mail correspondence 376 1.74 0.61 -0.55 **

Q9e Accessibility of the company 85 1.82 0.65 -0.50 **

Q9f Facilities for internet banking 302 1.91 0.68 -0.56 **

Q10a Friendliness of employees 202 2.02 0.57 -0.40 **

Q10b Capability of employees 250 1.84 0.60 -0.65 **

Q10c Reliability of employees 327 1.94 0.49 -0.62 **

Q10d Openness for questions 360 1.70 0.69 -0.66 **

Q10e Responsiveness of the company 219 1.97 0.50 -0.49 **

Q10f Handling of complaints 656 1.67 0.71 -0.67 **

** = p < 0.001

Table 8: Distribution of Customer Segment Within the Company, the Panel and the Sample Customer Segment Company Sample Weighting factor

Top 30% 61% 30 / 61

Standard 44% 30% 44 / 30

Development 26% 9% 26 / 9

3 Measurement analyses

Measurement analyses aim to construct scales and to evaluate their psychometric quality. We

used Mokken’s MH model (Chapter 4) to analyse the data representing the participants’

responses to the measurement instruments used in the empirical study. The use of the MH

model yielded the measurement scales and the participants’ scale scores. All measurement

analyses were done on the basis of the data from the main study.

The scales of customer satisfaction, customer satisfaction on the basis of the ACSI, trust,

and customer loyalty were constructed using the MH model. For this purpose, the software

program MSPwin5.0 was used (Molenaar & Sijtsma, 2000). Because it was hypothesised that

each set of items reflecting a construct constituted a unidimensional scale, the confirmatory

search strategy of Mokken scale analysis (Chapter 4) was used.

For the analysis of the item scores reflecting quality, both Mokken scale analyses and

factor analyses (Gorsuch, 1983, pp. 239-256) were used. The Mokken scale analyses were

done using MSPwin5.0, and the factor analyses were done using proc factor (SAS STAT).

Because it was expected that the items reflecting quality constituted multiple scales and we

had no hypothesis about the number of scales, we used exploratory strategies for scale

development.

Factor analysis (e.g., Bollen, 1989; Gorsuch, 1983) is a technique for investigating the

dimensionality of an item set. If the researcher has a hypothesis regarding the dimensionality

of the item set and which items load on particular factors, he or she may apply confirmatory

factor analysis to test this hypothesis. (e.g., Bollen, 1989). If the researcher does not have such

a hypothesis, exploratory factor analysis (e.g., Gorsuch, 1983) may be used for investigating

the structure of the item set and the identification of common factors that account for

correlations in the item set.

Hierarchical factor analysis (Gorsuch, 1983, pp. 239-256) is a type of exploratory factor

analysis, which may be used to explore the dimensionality in a dataset if dimensions are non-

orthogonal, meaning that factors are correlated. Instead of computing loadings for often

difficult to interpret oblique factors, the correlation matrix of oblique factors is further factor-

analysed. This analysis yields one or more higher-order factors that account for the common

variance that is due to all items, and two or more orthogonalised lower-order factors that

account for the common variance that is due to clusters of items (Gorsuch, 1983, pp. 248-

Following Wirtz (2000) and Wirtz and Bateson (1995), who reported the presence of

halo effects in measurements of attribute satisfaction (Oliver, 1993), we suspected that halo

effects also could prevail in the measurement of the quality of attributes of products and

services provided by the company. These halo effects might strengthen the correlations

between all items, and cause strong correlations between factors reflecting different

dimensions of quality. Therefore we chose hierarchical factor analysis for the exploration of

the dimensionality of the data about quality. In order to explore the robustness of the results of

the factor analysis, we also applied Mokken scale analysis to the data.

Customer satisfaction was operationalised using the measurement instrument presented in

Chapter 5 (Table 1 in Chapter 5). It was hypothesised that the nine items constitute a scale

according to the MH model. To test this hypothesis, Mokken scale analysis was done using

MSPwin5.0. First, the dimensionality of the item set was investigated using the confirmatory

strategy (Section 6 from Chapter 4). Second, the assumption of monotonicity was investigated

(Section 6 from Chapter 4). Third, the scale-score statistics (Molenaar & Sijtsma, 2000, pp.

60-61) were evaluated. Fourth, the scalability of the item set within distinct customer

segments, gender groups, and age groups was investigated. For this purpose, customer

segment, gender, and age group were defined as grouping variables (Molenaar & Sijtsma,

2000, pp. 28-29). Fifth, univariate analyses of variance were done to test whether subgroups

defined on the basis of customer segment, gender, and age differed significantly with respect

to scale scores. For this purpose, proc GLM (SAS STAT) was used. Sixth, the effect of

outliers on the results was investigated by repeating the analyses on the reduced dataset (i.e.,

the dataset without outliers, see Section 2).

The confirmatory Mokken scale analyses (item selection method = Test) demonstrated

that the nine items constituted a Mokken scale with a total-scale scalability coefficient H

equal to 0.59 and a reliability coefficient rho equal to 0.91 (Table 9). The lowest item

scalability coefficient Hi was equal to 0.50, which is well above the default lowerbound for

the Hi used in exploratory analyses (i.e., lowerbound Hi = 0.3). This result supported the

inclusion of all nine items in the scale, and thus the conception of customer satisfaction as a

unidimensional construct. The scale consists of items that are indicative of satisfaction and

items that are counter-indicative of satisfaction. This result supports the conception of

customer satisfaction as the bipolar opposite of customer dissatisfaction.

The check for item monotonicity on the basis of the default options in MSPwin5.0 (i.e.,

Minvi = 0.03 and Minsize = 168, which is 10 percent of the sample) did not reveal any

e’s T

s with

violations of the assumption of monotonicity. This means that the ISRF’s of all items were

nondecreasing for all rest-score groups. However, the check for item monotonicity on the

basis of smaller rest-score groups (i.e., Minsize = 84, which is 5 percent of the sample)

yielded two significant violations of the assumption of monotonicity. These violations were

due to small decreases in the estimated ISRF for Q3d >= 2 (There are good reasons to leave

BANK; Table 4) (Figure 1) and the estimated ISRF for Q4c >= 4 (I have regretted my choice

for BANK; Table 4) (Figure 2). Thus, the MH model did not fit the data perfectly.

The psychometric properties of the scale were slightly improved if item Q3d was

removed from the scale. The 8-item scale yielded a total-scale scalability coefficient H equal

to 0.59 without significant violations of the assumption of monotonicity, a result that was also

found when the assumption was tested on the basis of small rest-score groups (i.e., Minsize =

84). However, it is doubtful whether the 8-item scale yielded better measurements of

satisfaction, because each item in the scale is important for sufficient content validity (i.e.,

equal coverage of all aspects of customer satisfaction in the scale). We decided to proceed

with the 9-item scale because the violations of monotonicity in the 9-item scale were small,

and the 9-item scale had the best content validity.

Figure 1: Item step response functions of item Q3d: There are good reasons to leave BANK

Figure 2: Item step response functions of item Q4c: I have regretted my choice for BANK

The customer satisfaction scale-score distribution is presented in Figure 3. It may be

noted that the distribution of scale scores was significantly skewed to the left (p < 0.001), and

that there were outliers in the skew tail. The negative skewness is a common result in

customer satisfaction measurements (Peterson & Wilson, 1992). It is unknown whether the

outliers were caused by extreme dissatisfaction of the corresponding participants with the

company or by stylistic responding. Stylistic responding is investigated in Chapter 8.

The Mokken scale analyses using the grouping variables customer segment (valued Top

Customers, Standard Customers, and Development Customers; see Chapter 5), gender (valued

female, male, and missing), and age (valued 18 to 39 years, 40 years to 59 years, and 60 years

onwards; see Chapter 5) demonstrated that the nine items constituted a strong Mokken scale

(i.e., H > 0.5) in each subgroup (Table 9). The checks for item monotonicity did not yield

significant violations of the assumption of monotonicity, a result that was also found for

smaller rest-score groups (i.e., Minsize = 84). For this reason, it was concluded that the 9-item

scale may be used to measure customer satisfaction in different subgroups of the target

population.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36

Figure 3: Distribution of customer satisfaction scores in the complete dataset (N = 1689, mean = 25.96, SD = 5.57, and skewness = -0.85)

Table 10 shows that customer segments differed significantly with respect to scale

score (based on analysis of variance). This result is consistent with results from previous

satisfaction studies done by the company (e.g., Terpstra, 2005), and it suggests that the three

customer segments differ with respect to the average satisfaction with the company. The

result also supports the pursuit of proportional representation of customer segments in

descriptive studies of customer satisfaction. Furthermore, gender groups did not differ

significantly with respect to scale score (Table 10). Age groups differed significantly with

respect to scale score (Table 10). The latter result was unexpected, but because the magnitude

of the differences between the age groups was small, we considered it unimportant in the

context of the present study.

The analyses of the reduced dataset yielded similar results as the analyses of the

complete dataset. The confirmatory Mokken scale analyses (item selection method = Test)

yielded a scale with a total-scale scalability coefficient H equal to 0.60 and a reliability

coefficient rho equal to 0.91 (Table 11). The check for item monotonicity on the basis of the

default options (i.e., Minvi = 0.03 and Minsize = 165, which is 10 percent of the sample) did

not reveal violations of the assumption of monotonicity. The same result was found for the

check for item monotonicity with smaller rest-score groups (i.e., Minsize = 83). Thus, the MH

model fitted the data in the reduced dataset.

The Mokken scale analyses using the grouping variables customer segment, gender, and

age yielded a strong Mokken scale (i.e., H > 0.5) in each subgroup (Table 11). The checks for

item monotonicity on the basis of the default options (i.e., Minvi = 0.03 and Minsize = 165,

which is 10 percent of the sample) did not yield significant violations of the assumption of

monotonicity in subgroups. However, the check for item monotonicity on the basis of smaller

rest-score groups (i.e., Minsize = 83) yielded a significant violation of the assumption of

monotonicity for item Q4c (Table 4) in the age group of 60 years and older. This was due to a

decrease of the estimated ISRF for Q4c >= 3 (i.e., the proportion of responses Q4c >= 3

decreased from 1.00 in the middle rest-score group to 0.96 in the highest rest-score group).

Because the magnitude of the decrease was small, we considered it not disturbing and we

concluded that the scale score is useful for the measurement of customer satisfaction in

different subgroups of the target population.

The customer satisfaction scale-score distribution (Figure 4) was significantly skewed

to the left (p < 0.001). Furthermore, univariate analyses of variance demonstrated that the

customer segments and the age groups differed significantly with respect to scale score (Table

12). Gender groups did not differ significantly.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36

Figure 4: Distribution of customer satisfaction scores in the reduced dataset (N = 1650, mean = 26.04, SD = 5.50, and skewness = -0.84)

e’s T

s with

American Customer Satisfaction Index

The ACSI (Table 2 in Chapter 5) was used as the second operationalisation of customer

satisfaction. The empirical data were analysed by means of Mokken scale analyses. The

analyses were done in both the complete dataset and the reduced dataset. First, the

dimensionality of the item set was investigated using the confirmatory strategy (see Chapter

4). Second, the assumption of monotonicity was tested using the default check for item

monotonicity (see Chapter 4). Third, the scale scores and the scale-score statistics were

computed.

The analyses of the complete dataset demonstrated that the three ACSI items

constituted a strong Mokken scale (Table 13). The default check for item monotonicity (i.e.,

Minvi = 0.03, Minsize = 168) did not yield violations of the assumption of monotonicity.

Thus, the MH model fitted the data. The scale-score distribution is presented in Figure 5, and

shows negative skewness, outliers in the skew tail, peaks for the scale-scores 18 and 21, and a

drop for scale-score 22. Because our major concern was the measurement of customer

satisfaction on the basis of the nine-item scale (Table 1 in Chapter 5) and we expected that the

irregularities of the ACSI score distribution would not seriously hamper the tests of the

hypotheses (Section 4), we refrained from inquiries into the causes of the irregularities of the

ACSI score distribution.

The analyses of the reduced dataset (Table 13) yielded similar results as the analyses in

the complete dataset. Thus, the MH model also fitted the data in the reduced dataset. The

scale-score distribution is presented in Figure 6. The results in the reduced dataset were

similar to the results in the complete dataset.

Table 13: ACSI’s Total-Scale Scalability Coefficients H, Item Scalability Coefficients Hi, and Reliability Coefficients Rho in the Complete Dataset (CD) and the Reduced Dataset (RD) Item CD (N = 1684) RD (N = 1650)

How satisfied are you with BANK? 0.84 0.86

To what extent does BANK meet your ideal of a bank? 0.81 0.82

To what extent has BANK met your expectations? 0.82 0.83

H 0.82 0.83

Rho 0.92 0.92

0 2 4 6 8 10 12 14 16 18 20 22 24 26

Figure 5: Distribution of ACSI scores in the complete dataset (N = 1684, mean = 19.20, SD = 3.80 and skewness = -1.08)

0 2 4 6 8 10 12 14 16 18 20 22 24 26

Figure 6: Distribution of ACSI scores in the reduced dataset (N = 1650, mean = 19.22, SD = 3.77, and skewness = -1.06)

The empirical data collected by means of the trust instrument (see Chapter 5, Table 3) were

analysed by means of Mokken scale analyses. First, the dimensionality of the item set was

investigated using the confirmatory research method (Chapter 4). Second, the assumption of

monotonicity was tested on the basis of the default check for item monotonicity (Chapter 4).

Third, the scale scores and the scale-score statistics were computed.

The analyses of the complete dataset demonstrated that the seven items for trust

constituted a Mokken scale (Table 14). The default check for item monotonicity (i.e., Minvi =

0.03, Minsize = 168) yielded no violations of the assumption of monotonicity. Thus, the MH

model fitted the data. The scale-score distribution is presented in Figure 7. The distribution

was significantly skewed to the left, had outliers in the skew tail, and a large peak for scale-

score 21. Because our major concern was the measurement of customer satisfaction and we

expected that the irregularities of the trust score distribution would not seriously hamper the

tests of the hypotheses (Section 4), we refrained from inquiries into the causes of the

irregularities in the trust score distribution.

The analyses of the reduced dataset yielded similar results as the analyses of the

complete dataset (Table 14). Thus, the MH model also fitted the data in the reduced dataset.

The scale-score distribution is presented in Figure 8. The results in the reduced dataset were

similar to the results in the complete dataset.

Table 14: Trust Scale’s Total-Scale Scalability Coefficients H, Item Scalability Coefficients Hi, and Reliability Coefficients Rho in the Complete Dataset (CD) and the Reduced Dataset (RD) Item CD (N = 1689) RD (N = 1650)

I can depend on BANK to treat me fairly 0.66 0.66

I can depend on BANK to handle my banking affairs correctly 0.69 0.69

I can depend on BANK to keep its promises 0.63 0.63

I sometimes doubt the competence of BANK * 0.57 0.58

I sometimes doubt the good will of BANK * 0.57 0.57

I can trust BANK 0.66 0.66

I can depend on BANK to serve me well 0.63 0.63

H 0.63 0.63

Rho 0.91 0.91

* = scored reversely

100150200250300350400450500

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28

Figure 7: Distribution of trust scores in the complete dataset (N = 1689, mean = 19.71, SD = 4.02, and skewness = -0.71)

100150200250300350400450500

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28

Figure 8: Distribution of trust scores in the reduced dataset (N = 1650, mean = 19.75, SD = 3.97, and skewness = -0.73)

Quality

Quality was operationalised using the set of 24 items measuring judgements of attributes of

products and services provided by the retail bank (Table 7). First, two items (i.e., Q10d and

Q10f; Table 7) were excluded from the analyses because of the large percentages of missing

values on these items. Second, because many item scores were missing due to

inappropriateness of item content for several participants, factor analysis and Mokken scale

analysis were done based on listwise deletion (see Section 2). The number of available

participants for the analyses of the remaining 22 items was N = 599 in the complete dataset

and N = 591 in the reduced dataset.

Factor analysis (e.g., Gorsuch, 1983) was used to establish the dimensionality of the

data set for the 22 quality items. Exploratory factor analysis was done to identify the factor

structure of the dataset, and hierarchical factor analysis was done to investigate the relations

among the factors. The results of these analyses were used to construct scales for quality.

Next, Mokken scale analysis was done to assess explore the robustness of the results. The

analyses were repeated in the reduced dataset.

The exploratory factor analysis with squared multiple correlations used as prior

communality estimates yielded only eleven positive eigenvalues (this is the result of inserting

estimates of the communalities in the trace of the correlation matrix; see also Tabachnick &

Fidell, 2007, p. 631), and the primary four factors explained almost 91 percent of the common

variance (Table 15). Because we expected a large number of factors, we decided to proceed

with all four factors in the hierarchical factor analysis. This decision was supported by the

simple structure (Gorsuch, 1983, pp. 176-179) of the non-orthogonally rotated (i.e., using

method promax) 4-factor solution, which was readily interpretable.

The hierarchical factor analysis was done using an iterative procedure to estimate the

communalities, and using an oblique rotation method (i.e., method promax). The eigenvalues

are reported in Table 15, and the inter-factor correlations of the four oblique-rotated factors

were high (Table 17). The factor analysis of the correlation matrix of the oblique factors

yielded one higher-order factor. The higher-order factor reflected all quality items and

accounted for approximately 72 percent of the common variance in the items (Table 16). The

four orthogonalised lower-order factors reflected quality of contact handling, quality of

Internet facilities, quality of processes, and equity of costs and revenues, respectively, and

accounted for approximately 28 percent of the common variance in the items (Table 16).

Because the major part of the common variance was explained by the higher-order

factor, we had doubts about the dimensionality of the quality items and the interpretation of

the lower-order factors. These doubts were enhanced by exploratory Mokken scale analyses

(item selection method = Search normal, and lowerbound Hi = 0.3 were used), which yielded

a 20-item scale in the complete dataset and a 21-item scale in the reduced dataset (Table 18).

It seems that a general perception of the quality of the company affected the participants’

responses to all items regarding quality of attributes of products and services provided by the

company.

Based on these results, we suspected that a halo effect (Thorndike, 1920) had affected

the responses to the items reflecting quality. Wirtz and Bateson (1995; also Wirtz, 2000)

reported a similar result in studies into drivers of customer satisfaction. In addition to the

complications caused by the missing data on the items reflecting quality, we decided to use in

the remainder of this study the data collected by means of the set of 16 items measuring the

experience of problems with BANK in the preceding twelve months (Table 4, Chapter 5).

Table 15: Eigenvalues (EV) and Percentages Common Variance Explained (PCVE) from Principal Factor Analyses (PFA) and Hierarchical Factor Analyses (HFA) on the Quality-Items Complete dataset (N = 599) Reduced dataset (N = 591)

PFA HFA PFA HFA

EV PCVE EV PCVE EV PCVE EV PCVE

1 8.44 67.955 8.46 67.196 8.30 67.425 8.32 66.667

2 1.29 10.386 1.35 10.723 1.27 10.317 1.33 10.657

3 0.93 7.488 0.98 7.784 0.95 7.717 1.00 8.013

4 0.59 4.750 0.61 4.845 0.58 4.712 0.61 4.888

5 0.40 3.221 0.37 2.939 0.40 3.249 0.37 2.965

6 0.31 2.496 0.30 2.383 0.31 2.518 0.30 2.404

7 0.22 1.771 0.21 1.668 0.23 1.868 0.22 1.763

8 0.12 0.966 0.13 1.033 0.13 1.056 0.13 1.042

9 0.08 0.644 0.09 0.715 0.09 0.731 0.10 0.801

10 0.03 0.242 0.06 0.477 0.04 0.325 0.06 0.481

11 0.01 0.081 0.03 0.238 0.01 0.081 0.04 0.321

s of t

s of e

ird lo

Table 17: Correlations Between the Four Factors Representing Quality (Upper Half = Complete Dataset, Lower Half = Reduced Dataset) Factor1 Factor2 Factor3 Factor4

Factor1 0.68 0.63 0.41

Factor2 0.67 0.73 0.53

Factor3 0.62 0.72 0.46

Factor4 0.42 0.54 0.44

Table 18: Quality Scale’s Total-Scale Scalability Coefficients H, Item Scalability Coefficients Hi, and Reliability Coefficients Rho in the Complete Dataset and the Reduced Dataset Item Complete dataset Reduced dataset

Correct execution of orders 0.47 0.45

Speed of money transfers 0.35 0.34

Speed of service delivery 0.49 0.47

Adherence to promises 0.49 0.47

Correct execution of banking matters 0.53 0.51

Distribution of bank statements * *

Costs of accounts of the company 0.38 0.40

Convenience of products and services 0.49 0.47

Clarity of information provided 0.50 0.48

Sufficiency of information provided 0.50 0.49

Costs of services of the company 0.40 0.40

Interest rates of the company * 0.30

Service by telephone 0.48 0.46

Service by the internet 0.45 0.44

Service by bank offices 0.32 0.32

Service by mail correspondence 0.49 0.47

Accessibility of the company 0.49 0.46

Facilities for internet banking 0.44 0.42

Friendliness of employees 0.43 0.41

Capability of employees 0.47 0.46

Reliability of employees 0.52 0.50

Responsivenss of the company 0.52 0.50

H 0.46 0.43

Rho 0.93 0.93

* = excluded from the scale because item scalability coefficient Hi < 0.3

The distribution of the number of problems with BANK in the preceding twelve months

is presented in Table 19. In both the complete dataset and the reduced dataset, 57 percent of

the participants mentioned the incidence of at least one problem with BANK in the preceding

twelve months.

Exploratory Mokken scale analyses (item selection method = Search normal, and

lowerbound Hi = 0.3 were used) yielded five scales of two items each, and six items that were

non-scalable. This result indicates that the responses to the items were not the result of a

unidimensional trait such as a general perception of the quality of the company. This result is

consistent with the conception of quality as a multidimensional construct.

In the remainder of this study, quality was re-defined as absence of problems. This

definition of quality is in line with the conception of quality as absence of failures (e.g.,

Garvin, 1983; Kackar, 1989, p. 6; Woodall, 2001; see also Chapter 3). Because the experience

of a problem is counter-indicative of quality, the items reflecting experience of problems were

recoded into the opposite direction (Section 2). Quality was then operationalised as the total

score on the 16 recoded items regarding the incidence of problems with BANK in the

preceding twelve months (Table 19). The quality score (i.e., total score) ranged from 0 (if the

participant had 16 problems with BANK in the preceding 12 months) to 16 (if the participant

had 0 problems with BANK in the preceding 12 months).

The distribution of the quality scores was negatively skewed (Table 19). This may

hamper the tests of the hypothesis 5 (i.e., satisfaction scores are positively related to quality

scores) and hypothesis 9 (i.e., satisfaction scores are not contaminated by quality). Following

a suggestion of Tabachnick and Fidell (2007, pp. 87-89) to reflect negatively skewed

variables and transform the reflected variables, we applied a logarithmic transformation to the

variable number of problems. Let NP denote the number of problems, TNP transformed NP,

and ln the natural logarithm. Because the minimum value for NP was zero, we applied the

following transformation:

)1ln( += NPTNP .

The hypotheses 5 and 9 were tested once using the quality scores, and once using TNP.

Table 19: Distribution of the Number of Problems (NP), Transformed Number of Problems (TNP), and Quality Scores in the Complete Dataset and the Reduced Dataset

NP TNP Quality Score Percentage in Complete

Dataset (N=1689)

Percentage in Reduced

Dataset (N=1650)

0 0.69 16 43 43

1 1.10 15 25 25

2 1.39 14 16 16

3 1.61 13 8 8

4 1.79 12 4 4

5 1.95 11 2 2

6 2.08 10 1 1

>= 7 >=2.20 <=9 1 1

Customer loyalty

Two items (i.e., Q14c and Q14e; Table 4) were deleted from the customer loyalty item set

(Table 6, Chapter 5) due to unfortunate item wording (Section 2). Mokken scale analyses

were done to investigate whether the MH model fitted the data from the remaining four items.

The analyses were done in both the complete dataset and the reduced dataset. First, the

dimensionality of the data was investigated using the confirmatory research method (Chapter

4). Second, the assumption of monotonicity was tested on the basis of the default check for

item monotonicity (Chapter 4). Third, the scale scores and the scale-score statistics were

computed.

The analyses of the complete dataset yielded a total-scale scalability coefficient H

equal to 0.54 and a reliability coefficient rho equal to 0.80. However, the default check for

item monotonicity (i.e., Minvi = 0.03 and Minsize = 168) revealed significant violations of the

assumption of monotonicity. The estimated ISRF for Q14d >= 3 (I consider switching from

BANK to another bank; Table 4) decreased at the end of the rest-score scale, and the estimated

ISRF for Q14d>= 4 decreased at the beginning of the scale (Figure 9). The checks for item

monotonicity with smaller rest-score groups (i.e., Minsize = 84) revealed that the estimated

ISRF for Q14d >= 3 also decreased at the beginning of the scale (Figure 10). Thus, the MH

model did not fit the data. The analyses in the reduced dataset corroborated the results found

in the reduced dataset. Thus, the MH model also did not fit the data in the reduced dataset.

Figure 9: Item step response functions of item Q14d: I consider switching from BANK to another bank (Minsize = 168)

Figure 10: Item step response functions of item Q14d: I consider switching from BANK to another bank (Minsize=84)

Because the violations of monotonicity were substantial, we decided to repeat the

measurement analyses without item Q14d (Table 4). The analyses of the complete dataset

yielded a total-scale scalability coefficient H of 0.64 and a reliability coefficient rho of 0.82

(Table 20). The default item-check for monotonicity (i.e., Minvi = 0.03 and Minsize = 168)

did not reveal violations of the assumption of monotonicity. This result was also found with

smaller rest-score groups (i.e., Minsize = 84). Thus, the MH model fitted the data for the three

items in the complete dataset. The analyses of the reduced dataset yielded similar results as

the analyses of the complete dataset (Table 20). Thus, the MH model also fitted the three

items in the reduced dataset.

The content validity of the 3-item scale was considered sufficient because the three

items reflected the three aspects of customer loyalty (see Table 6 in Chapter 5). Because of

sufficient coverage of customer loyalty and because the 3-item scale met the requirements of

the MH-model, we decided to use the 3-item scale to measure customer loyalty in all

subsequent analyses. The corresponding scale-score distributions are presented in Figure 11

(complete dataset) and Figure 12 (reduced dataset). The scale-score distributions were skewed

to the left. We refrained from inquiries into the cause of the skewness, because we expected

that the skewness would not seriously hamper the test of the hypotheses (Section 4).

Table 20: Customer Loyalty Scale’s Total-Scale Scalability Coefficients H, Item Scalability Coefficients Hi, and Reliability Coefficients Rho in the Complete Dataset (CD) and the Reduced Dataset (RD) Item CD (N = 1686) RD (N = 1650)

If I need new financial products, BANK is my first choice 0.65 0.64

I have more sympathy for BANK than for other banks 0.63 0.63

For many years BANK has been my primary bank 0.65 0.64

H 0.64 0.64

Rho 0.82 0.81

100150200250300350400450

0 1 2 3 4 5 6 7 8 9 10 11 12

Figure 11: Distribution of customer loyalty scores in the complete dataset (N = 1686, mean = 7.87, SD = 2.42, and skewness = -0.62)

100150200250300350400450

0 1 2 3 4 5 6 7 8 9 10 11 12

Figure 12: Distribution of customer loyalty scores in the reduced dataset (N = 1650, mean = 7.89, SD = 2.40, and skewness = -0.62)

4 Tests of the hypotheses

In this section, the tests of the hypotheses (see Chapter 4) are discussed. Successively, we

tested the hypotheses regarding explicit construct representation, concept-related irrelevant

variance, method-related irrelevant variance, and implicit construct representation. The

purpose of these tests was to collect empirical evidence with respect to the validity of the

Explicit construct representation

Hypothesis 1 was: customer satisfaction is manifested in various expressions that are

mutually related but not sharply delineated. The hypothesis was tested by means of an

examination of the verbal explanations of satisfaction given by the participants to the pre-

tests. The pre-tests data demonstrated that participants attached diverse meanings to the term

satisfaction (see Table 1). When asked to explain their satisfaction with the company in their

own words, participants answered in terms of (a) general affect, (b) friendliness, (c) past

performances, (d) qualities of the company, (e) absence of dissatisfaction, and (f) trust in the

company. With respect to the last result, some participants answered ‘I trust the company’,

‘The company will not deceive me, such as … did ’, or ‘I don’t think they deceive me’. These

answers indicate that overall satisfaction with a particular retail bank and trust of the bank are

strongly interrelated. The results support the hypothesis that satisfaction is manifested in

various expressions that are mutually related but not sharply delineated.

Hypothesis 2 was: the satisfaction items constitute a scale according to the MH model.

The hypothesis was supported by the results of the measurement analyses (see Section 3),

which demonstrated that the items constituted a strong MH model scale in the whole sample

as well as all subgroups investigated.

Hypothesis 3 was: the satisfaction measure is positively correlated to other measures of

satisfaction. The hypothesis was tested by means of correlation analyses between the

satisfaction measure and the ACSI. The correlation was significant (p < 0.001) in both the

complete dataset and the reduced dataset (Table 21). Thus, the hypothesis was supported.

Table 21: Product-Moment Correlations (r) Between Satisfaction and the ACSI

Complete Dataset (N = 1681) Reduced Dataset (N = 1650)

r 95%-interval for ρ r 95%-interval for ρ

0.78* 80.076.0 ≤≤ ρ 0.79* 81.077.0 ≤≤ ρ

* = p <0.001

Concept-related irrelevant variance

Following Oort (1996), the hypotheses regarding concept-related irrelevant variance were

tested using restricted factor analysis. Restricted factor analysis is confirmatory factor

analysis with particular restrictions on the loadings. In restricted factor analysis, a model is

specified such that the indicators of the trait load on the factor reflecting the trait, and not on

the factor reflecting the violator. Thus, the loadings of the indicators of the trait on the factor

reflecting the violator are restricted to the value 0. The loadings of the indicators reflecting the

the violator on the factor reflecting the trait are also restricted to the value 0. Then, the fit of

the model is evaluated in order to determine the model’s tenability.

Oort (1996, pp. 46-49) suggested to use the modification indices (MI’s) or adjusted

modification indices (AMI’s; to be discussed later) to detect item bias (i.e., whether particular

indicators reflecting the trait are biased with respect to a violator). The MI is a statistic which

reveals how much the fit of the model will improve if the factor loading of an indicator I of

trait T on violator V is set free to be estimated. The MI is approximately chi-squared

distributed with one degree of freedom (Bollen, 1989, p. 299). If the MI’s reveal that the fit of

the model will improve significantly by allowing a particular indicator I to load on violator V,

this means that indicator I is biased with respect to violator V, and that the measurement of

trait T is contaminated with respect to violator V. If the MI’s reveal that the fit of the model

will not be improved significantly by allowing a particular indicator I to load on violator V,

this means that none of the indicators I is biased with respect to violator V, and that the

measurement of trait T is not contaminated with respect to violator V.

A larger number of significance tests and a larger sample size increase the likelihood of

finding significant MI’s and of obtaining false positives. In order to reduce the risk of false

positives, Oort (1996, p. 49) suggested to use AMI’s to detect biased items. The AMI is a

statistic, which reduces the power of the MI, and thus is useful for the detection of items that

are substantially biased. The AMI is defined as:

AMI = ((df – 1) / (χ2 – MI)) * MI,

where χ2 is the chi-squared value and df is the degrees of freedom under the null model (i.e.,

the restricted model). If the AMI exceeds a critical chi-squared value with one degree of

freedom, such as the critical chi-squared value for the 5 percent level of significance (i.e, χ2 =

3.84), the item is judged to be biased.

In this study, restricted factor analysis was performed using proc calis (SAS STAT).

Thus, a model was specified in which the nine items reflecting satisfaction loaded on a factor

reflecting satisfaction, and not on a factor reflecting the violator under investigation (Figure

13). The indicator of the violator loaded on the factor reflecting the violator and not on the

factor reflecting satisfaction. Because only one indicator loaded on the factor reflecting the

violator, no error term was specified for that indicator (Oort, 1996, p. 47). The AMI’s were

calculated by hand on the basis of the chi-squared value and the degrees of freedom under the

null model, and the MI’s of the nine items reflecting satisfaction. The fit of the model was

evaluated on the basis of the goodness of fit index (GFI), the normed fit index (NFI), and the

non-normed fit index (NNFI). As a rule of thumb, indices having a value of 0.90 or higher

indicate an acceptable fit (e.g., Bollen, 1989, pp. 269-281). The analyses were performed on

both the complete dataset (N = 1689) and the reduced dataset (N = 1650).

I10 I1 I2 I9

…………..

Figure 13. Graphical display of the factor model with nine indicators of customer satisfaction and one indicator of the violator.

Hypothesis 8 was: the satisfaction scores are not contaminated by trust. This

hypothesis was tested by means of a restricted factor analysis model using the nine items

reflecting satisfaction with the company, and the trust score. The factor model was specified

such that the nine items reflecting satisfaction loaded on the factor reflecting satisfaction, and

the trust score loaded on the factor reflecting the violator. The factor model fitted the data

well (i.e., GFI = 0.92; NFI = 0.93; NNFI = 0.92), and none of the AMI’s was significant

(Table 22; the complete dataset). Similar results were found in the reduced dataset (Table 22;

the reduced dataset). Thus, none of the items reflecting satisfaction was significantly biased

with respect to trust, and the hypothesis was supported.

Table 22: AMI’s of the Satisfaction Items in the Complete Dataset and the Reduced Dataset Complete dataset

(N=1689)

Reduced dataset

(N=1650)

Item AMI p-value AMI p-value

At BANK I feel at home 0.02 ns 0.00 ns

I am satisfied with BANK 0.66 ns 0.67 ns

There are good reasons to leave BANK * 0.18 ns 0.45 ns

I have mixed feelings about BANK * 0.11 ns 0.02 ns

BANK meets all my requirements for a bank 0.29 ns 0.65 ns

Last year I had a pleasant relationship with BANK 0.82 ns 1.52 ns

BANK has met my expectations 0.07 ns 0.05 ns

I have regretted my choice for BANK * 0.02 ns 0.10 ns

Last year I had some problems with BANK * 1.42 ns 1.42 ns

Hypothesis 9 was: the satisfaction scores are not contaminated by quality. This

reflecting satisfaction with the company and the quality scores (because the analyses using

quality scores yielded similar results as the analyses using TNP (Section 3), we reported the

results from the former analyses). The factor model was specified such that the nine items

reflecting satisfaction loaded on the factor reflecting satisfaction, and the quality score loaded

on the factor reflecting the violator. The factor model did not fit the data well (i.e., NNFI =

0.89, which is below the critical value of 0.90 for the NNFI), and the AMI of item Q4d (Last

year I had some problems with BANK; Table 4) was significant (Table 23; the complete

dataset). Similar results were found in the reduced dataset (Table 23; the reduced dataset).

Thus, item Q4d was significantly biased with respect to quality, and the hypothesis was not

supported.

A restricted factor analysis without item Q4d (i.e., the factor model was specified such

that the remaining eight items reflecting satisfaction loaded on the factor reflecting

satisfaction) yielded a good fit (i.e., GFI = 0.93, NFI = 0.93, and NNFI = 0.91; the complete

dataset), and none of the AMI’s was significant. Similar results were found in the reduced

dataset. Thus, the contamination of satisfaction scores by quality was due to item Q4d only.

(N=1689)

Reduced dataset

(N=1650)

Last year I had some problems with BANK * 6.52 <0.05 6.87 <0.01

Hypothesis 10 was: the satisfaction scores are not contaminated by loyalty. This

reflecting satisfaction with the company, and the loyalty score. The factor model was

specified such that the nine items reflecting satisfaction loaded on the factor reflecting

satisfaction, and the loyalty score loaded on the factor reflecting the violator. The factor

model did not fit the data well (i.e., NNFI = 0.88, which is below the critical value of 0.90 for

the NNFI), and the AMI of item Q3a (At BANK I feel at home; Table 4) was significant (Table

24; the complete dataset). Similar results were found in the reduced dataset (Table 24; the

reduced dataset). Thus, item Q3a was significantly biased with respect to customer loyalty,

and the hypothesis was not supported.

A restricted factor analysis without item Q3a (i.e., the factor model was specified such

that the remaining eight items reflecting satisfaction loaded on the factor reflecting

satisfaction) yielded a good fit (i.e., GFI = 0.93, NFI = 0.93, and NNFI = 0.91; the complete

dataset), and none of the AMI’s was significant. Similar results were found in the reduced

dataset. Thus, the contamination of satisfaction scores by customer loyalty was due to item

Q3a only.

(N=1686)

Reduced dataset

(N=1650)

At BANK I feel at home 10.73 <0.01 12.00 <0.01

Hypothesis 11 was: the satisfaction scores are not contaminated by current customer

profitability. This hypothesis was tested by means of a restricted factor analysis model using

the nine items reflecting satisfaction with the company, and TCP2005 (i.e., the logarithmic

transformation of CP2005; Section 2). The factor model was specified such that the nine items

reflecting satisfaction loaded on the factor reflecting satisfaction, and TCP2005 loaded on the

factor reflecting the violator. The factor model fitted the data well (i.e., GFI = 0.92, NFI =

0.92, and NNFI = 0.90), and none of the AMI’s was significant (Table 25; the complete

dataset). Similar results were found in the reduced dataset (Table 25; the reduced dataset).

Thus, none of the items reflecting satisfaction was significantly biased with respect TCP2005,

and the hypothesis was supported.

(N=1689)

Reduced dataset

(N=1650)

Method-related irrelevant variance

Hypothesis 12 was: the satisfaction scores are not affected by the location of the satisfaction

items in the questionnaire. The hypothesis was tested by means of a t-test of the difference

between the average satisfaction score in the versions 1 and 2 of the pilot study, and the

average satisfaction score in the versions 3 and 4 of the pilot study (see Table 8 in Chapter 5;

note that the satisfaction score is the total score on the 9-item satisfaction scale). Because the

difference was not significant (Table 26), the hypothesis was supported.

Table 26: Differences of Satisfaction Scores in Groups of the Pilot Study Groups Difference t-statistic p-value

Hypothesis 12 -0.70 -1.19 ns

Hypothesis 13 -0.46 -0.79 ns

Hypothesis 13 was: the satisfaction scores are not affected by the presentation of the

response categories of the satisfaction items. The hypothesis was tested by means of a t-test

of the difference between the average satisfaction score in the versions 1 and 3 of the pilot

study, and the average satisfaction score in the versions 2 and 4 of the pilot study (see Table 8

in Chapter 5; note that the satisfaction score is the total score on the 9-item satisfaction scale).

Because the difference was not significant (Table 26), the hypothesis was supported.

Hypothesis 14 was: the satisfaction scores are not affected by the midpoint response

style. The test of this hypothesis required the measurement of general midpoint responding

(e.g., Baumgartner & Steenkamp, 2001, 2006). Because the present study did not yield

suitable data to create a suitable measure of general midpoint responding, the hypothesis was

tested in the second empirical study (Chapter 8).

Hypothesis 15 was: the satisfaction scores are not affected by the extreme response

style. The test of this hypothesis required the measurement of general extreme responding

(e.g., Baumgartner & Steenkamp, 2001, 2006). Because the present study did not yield

suitable data to create a suitable measure of general extreme responding, the hypothesis was

tested in the second empirical study (Chapter 8).

Implicit construct representation

The hypotheses regarding implicit construct representation were tested last, because the

results of the tests of other hypotheses were used in the tests of the hypotheses regarding

implicit construct representation. First, the test of hypothesis 9 demonstrated that item Q4d

(Last year I had some problems with BANK; Table 4) was biased with respect to quality. The

use of this item in the satisfaction scale was expected to inflate the correlation between

customer satisfaction and quality. Therefore, we decided to exclude the item from the

satisfaction scale when testing the hypothesis regarding the relation between customer

satisfaction and quality. Second, the test of hypothesis 10 demonstrated that item Q3a (At

BANK I feel at home; Table 4) was biased with respect to customer loyalty. The use of this

item in the satisfaction scale was also expected to inflate the correlation between customer

satisfaction and customer loyalty. Therefore, we decided to exclude this item from the

satisfaction scale when testing the hypothesis regarding the relation between customer

satisfaction and customer loyalty. The hypotheses concerning the relation of satisfaction

scores to trust scores, quality scores, and loyalty scores, respectively, were tested by means of

correlation analyses.

Hypothesis 4 was: satisfaction scores are positively related to trust scores. This

hypothesis was tested using the total score on the customer satisfaction scale and the total

score on the trust scale. The product-moment correlation between customer satisfaction and

trust was positive and significant (p < 0.001) in both the complete dataset and the reduced

dataset (see Table 27). Thus, the hypothesis was supported.

Hypothesis 5 was: satisfaction scores are positively related to quality scores. In order

to test this hypothesis, item Q4d (Table 4) was excluded from the customer satisfaction scale.

Thus, the hypothesis was tested using the total score on the 8-item customer satisfaction scale

and the quality scores (because the analyses using quality scores yielded similar results as the

analyses using TNP (Section 3), except that the correlations in the former analyses were

positive and the correlations in the latter analyses were negative, we reported the results from

the former analyses). The product-moment correlation between customer satisfaction and

quality was positive and significant (p < 0.001) in both the complete dataset and the reduced

dataset (Table 27). This means that the fewer problems a participant has had with BANK, the

higher his or her satisfaction with BANK was. Thus, the hypothesis was supported. Because it

may also be interesting to examine the relations between the experience of singular problems

and customer satisfaction, these relations were also reported (Table 28). These relations were

negative because the experience of a problem is counter-indicative of quality.

Hypothesis 6 was: satisfaction scores are positively related to loyalty scores. In order to

test this hypothesis, item Q3a (Table 4) was excluded from the customer satisfaction scale.

Thus, the hypothesis was tested using the total score on the 8-item customer satisfaction scale

and the total score on the customer loyalty scale. The product-moment correlation between

customer satisfaction and customer loyalty was positive and significant (p < 0.001) in both the

complete dataset and the reduced dataset (see Table 27). Thus, the hypothesis was supported.

Table 27: Product-Moment Correlations Between Customer Satisfaction and Other Concepts Complete dataset (N = 1689) Reduced dataset (N = 1650)

r 95%-interval for ρ r 95%-interval for ρ

Trust 0.78* 80.076.0 ≤≤ ρ 0.79* 81.077.0 ≤≤ ρ

Quality 0.47* 51.043.0 ≤≤ ρ 0.48* 52.044.0 ≤≤ ρ

Loyalty 0.51* 55.047.0 ≤≤ ρ 0.51* 55.047.0 ≤≤ ρ

* = p < 0.001

Table 28: Relations Between the Incidence of Singular Problems and Customer Satisfaction Complete dataset

(N=1689)

Reduced dataset

(N=1650)

Item Proportion Polychoric

correlation

Proportion Polychoric

correlation

Errors in the execution of your banking affairs 0.03 -0.33 0.03 -0.33

Errors in the execution of your orders 0.05 -0.28 0.05 -0.27

Insufficient information on your banking affairs 0.04 -0.44 0.05 -0.45

Ambiguous information on your banking affairs 0.06 -0.38 0.06 -0.38

Unfair costs of banking services 0.12 -0.40 0.12 -0.40

Slow service 0.06 -0.43 0.06 -0.45

Slow money transfers 0.16 -0.32 0.16 -0.33

Not keeping an appointment 0.03 -0.33 0.03 -0.32

Insufficient accessibility by telephone 0.05 -0.24 0.05 -0.24

Insufficient accessibility by internet 0.12 -0.24 0.12 -0.24

Insufficient accessibility of offices 0.09 -0.18 0.09 -0.18

Insufficient response to questions 0.06 -0.47 0.06 -0.47

Problems with debit cards 0.07 -0.20 0.07 -0.21

Problems with cash withdrawels 0.04 -0.09* 0.04 -0.10*

Problems with internet banking 0.14 -0.21 0.14 -0.22

Another problem 0.08 -0.29 0.08 -0.28

* = not significant at p <0.05

Hypothesis 7 was: satisfaction scores are positively related to future customer

profitability. In Chapter 3, the following model was suggested for the relation between

customer satisfaction (denoted CSt=0), other independent variables (denoted Xi), and future

customer profitability (denoted CPt>0):

εγβα ++++= ∑=> iitt XCSCP ...00 .

Because customer satisfaction was measured in September 2005, CSt=0 was operationalised as

customer satisfaction in September 2005 (denoted CS2005). Because it was expected that the

influence of customer satisfaction on CP manifested after one year (Section 5 from Chapter

3), CPt>0 was operationalised as CP in September 2006 (denoted CP2006). Furthermore,

because former studies indicated that current CP accounts for the largest part of future CP

(Section 5 from Chapter 3), Xi was operationalised as CP in September 2005 (denoted CP2005).

The preliminary analyses demonstrated that the distributions of CP2005 and CP2006 were

positively skewed and included many outliers in the skew tail (Section 2). Therefore, CP2005

and CP2006 were logarithmically transformed. The logarithmic transformation of CP2005 was

denoted TCP2005 and the logarithmic transformation of CP2006 was denoted TCP2006 (Section

2). Hypothesis 7 was tested by means of a regression analysis of TCP2006 on TCP2005 and

CS2005. TCP2006’ is the predicted value of TCP2006, a is the intercept, b1 is the effect of TCP2005

on TCP2006, and b2 is the effect of TCS2005 on TCP2006. The regression model was:

TCP2006’ = a + b1TCP2005 + b2CS2005. (Model 1)

We used hierarchical regression analyses (e.g., Cohen & Cohen, 1983, pp. 120-122; Hays,

1988, pp. 662-665; Tabachnick & Fidell, 2007, pp. 138-147) to compute the contribution of

each predictor to the explanation of TCP2006. Because we expected that TCP2005 accounted for

the largest part of TCP2006, TCP2005 was entered first in the analyses and CS2005 was entered

second. Statistic Fseq expresses the significance of sequential entries of predictor variables for

the explanation of the criterion variable. Let RM denote the restricted model without the

predictor variable of interest, ERM the error sum of squares under the restricted model, dfRM

the degrees of freedom under the restricted model, FM the full model including the predictor,

EFM the error sum of squares under the full model, and dfFM the degrees op freedom under the

full model. Then statistic Fseq is defined as (Maxwell & Delaney, 1990, pp. 73-74):

FMRMFMRMseq /

)/()(dfE

dfdfEEF −−= .

which theoretically follows an F distribution with dfRM – dfFM and dfFM degrees of freedom.

Following Cohen and Cohen (1983, p.155), we also computed the effect size (denoted f2) for

sequential entries of predictor variables. Let be the variance explained under the

restricted model and the variance explained under the full model. Then effect size f2 is

defined as:

1 RRRf

−−

The regression analyses were done using proc reg (SAS STAT). To assess the

robustness of the results, the full model was tested with and without weighting of participants

(Section 2), and with and without outliers (i.e., the complete dataset and the reduced dataset,

respectively). Thus, four regression analyses were done; the first analysis was in the complete

dataset without weighting of participants, the second analysis in the complete dataset with

weighting of participants, the third analysis in the reduced dataset without weighting of

participants, and the fourth analysis in the reduced dataset with weighting of participants.

Seven participants were excluded from the analyses because they had deceased since

September 2005.

The results from the regression analyses are presented in Table 29. The major statistics

reported are R2, which represents the cumulative proportion of the variance explained after

including a new predictor in the analysis; f2, which represents the effect size of each new

predictor entered in the analysis; Fseq, which represents the significance of each new predictor

for the explanation of CP2006; and SRW, which represents the standardised regression weight

(e.g., Hays, 1988, pp. 623-625) of each predictor. Because we reported the standardised

solution, intercept a was equal to zero and not reported in Table 29.

Each analysis demonstrated a significant contribution of CS2005 to the explanation of

TCP2006, when TCP2005 was accounted for (Fseq in Table 29). Furthermore, each analysis

yielded a positive effect for CS2005 on TCP2006 (SRW in Table 29). The similarity of the results

from the analyses demonstrates their robustness. Thus, hypothesis 7 was supported by the

results of the analyses.

The percentage explained variance of TCP2006 was 84% or more (R2 in Table 29) across

analyses. This result were almost completely due to TCP2005, which also had large effect size

(f2 in Table 29) in each analysis. Thus, current TCP was the main predictor of future TCP.

This result is in line with the results from former customer profitability analyses in the

financial services industry (e.g., Campbell & Frei, 2004; Terpstra, 2005, 2006b).

f2 F s

F2 F s

2 (3 )

(3 ) 0.

61 (3 )

0 (2 )

7 (3 )

(2 ) 0.

(3 ) 14

(3 ) 13

99(³)

³) 0.

(2 ) 0.

51(3 )

(2 ) 0.

17(3 )

r; f2 is

tistic

S M is

; df M

r of r

t; (¹)

t at p

5; (²

5 Relation between customer satisfaction and future CP with a time-lag of two years

The test of hypothesis 7 demonstrated that customer satisfaction was positively related to

future CP. It is unknown how a time lag larger than one year between measurements of

customer satisfaction and future CP affects the relation between customer satisfaction and

future CP. This warrants further research into the relation between customer satisfaction and

future CP. We investigated the relationship of customer satisfaction and future CP on

available data pertaining to a two-year time-lag.

Method

Because CP2005 and CP2007 were skewed and included many outliers, we applied a logarithmic

transformation to CP2005 and CP2007 (Section 2). The logarithmically transformed CP2005 was

denoted TCP2005 and the logarithmically transformed CP2007 was denoted TCP2007 (Section 2).

We regressed TCP2007 on TCP2005 and CS2005. TCP2007’ is the predicted value of TCP2007, a is

the intercept, b1 is the effect of TCP2005 on TCP2007, and b2 is the effect of CS2005 on TCP2007.

The regression model was:

TCP2007’ = a + b1TCP2005 + b2CS2005. (Model 2)

We used hierarchical regression analyses (e.g., Cohen & Cohen, 1983, pp. 120-122; Hays,

1988, pp. 662-665; Tabachnick & Fidell, 2007, pp. 138-147) to compute the contribution of

each predictor to the explanation of TCP2007. Because we expected that TCP2005 accounted for

the largest part of TCP2007, TCP2005 was entered first in the analyses and CS2005 was entered

second. In order to explore the robustness of the results, we estimated the model with and

without weighting of participants, and with and without outliers. Thus, we did four regression

analyses.

Results

The results are reported in Table 30. Because we reported the standardised solutions, intercept

a was equal to zero and not reported in Table 30. Each analysis demonstrated a significant

contribution of CS2005 to the explanation of TCP2007, when TCP2005 was accounted for.

Furthermore, each analysis yielded a positive effect for CS2005. The similarity of the results

from the analyses demonstrates their robustness. Thus, there is evidence of a relation between

customer satisfaction and future TCP, when future TCP is measured with a time lag of two

years.

f2 F s

(3 ) 0.

24 (3 )

(3 ) 0.

45 (3 )

0 (3 )

1 (3 )

2 (3 )

5 (3 )

(3 ) 10

0 (3 )

3 (3 )

7 (3 )

(3 ) 0.

81 (3 )

(3 ) 0.

51 (3 )

r; f2 is

tistic

S M is

; df M

(¹) =

1; (³

) = si

The computation of the predicted values for TCP2007 on the basis of the unstandardised

solutions (not shown here), and the exponential transformation of the predicted values for

TCP2007 to predicted values for CP2007, demonstrated that the impact of CS2005 on the

predicted value for CP2007 was dependent on the value for TCP2005. For customers having a

small value for TCP2005, the score for CS2005 had almost no impact on the predicted value for

CP2007, while for customers having a large value for TCP2005, the score for CS2005 had a

substantial impact on the predicted value for CP2007. This result may be due to using

logarithmically transformed values for CP2005 and CP2007 in the regression analyses, but we

consider it a plausible result which is is in agreement with the opinion in marketing that it is

important to keep profitable customers satisfied.

6 Discussion

The first empirical study demonstrated that the set of nine items reflecting customer

satisfaction constituted a strong ( ) scale according to the MH-model. Furthermore, the

study demonstrated several strengths and weaknesses of the measurement instrument for

customer satisfaction and the corresponding scale scores. A first strength is the explicit and

implicit definitions of customer satisfaction underlying the measurement instrument. All

aspects of customer satisfaction were evenly represented in the instrument, and this supports

the claim that the scale scores cover the meaning of customer satisfaction well. A second

strength is the fit of the measurement model. The tests of the model yielded no substantial

violations of the MH model, which supports the use of the scale scores to measure customer

satisfaction. Because the measurement instrument was composed of items that were indicative

of customer satisfaction and items that were counter-indicative of the construct, the fit of the

measurement model also confirms the conception of customer satisfaction as the opposite of

customer dissatisfaction on a bipolar dimension. A third strength is the fit of the measurement

model in the subgroups based on customer segment, age, and gender. This supports the

generalisability of the scale across subgroups in the target population. A fourth strength is that

the inclusion of items that are indicative and items that are counter-indicative of customer

satisfaction in the measurement instrument seems to limit the effects of aquiescent responding

on the scale scores (e.g., Baumgartner & Steenkamp, 2001, 2006, Van Herk, 2000, p.55). A

fifth strength of the scale is that the scale is composed of a large number of items, which

limited the effect of a biased item on the scale score. Lack of bias also supports the

confidence in the validity of the scale-score interpretations.

5.≥H

The major weakness of the scale scores was their divergent validity. The tests of the

hypotheses regarding concept-related irrelevant variance revealed that the customer

satisfaction scores were contaminated by quality and customer loyalty. This was due to the

items Q3a (At BANK I feel at home; Table 4) and Q4d (Last year I had some problems with

BANK; Table 4). For this reason, the scale had to be modified for research into the

connections of customer satisfaction with these constructs. A point of concern were the

outliers in the left-skew tail of the distribution of the customer satisfaction scores. It cannot be

ruled out that the outliers were due to stylistic responding.

The analyses into relations between customer satisfaction and future CP with a time lag

of two years yielded some important results. It was demonstrated that the influence of

customer satisfaction on customer profitability lasts for at least two years. This warrants

further research into the effect of customer satisfaction on the cumulated customer

profitability. Furthermore, a comparison of the results of the analyses predicting future CP

with a time lag of one year (Table 29) and the analyses predicting future CP with a time lag of

two years (Table 30) reveals that the influence of current CP on future CP decreases when the

time lag between the measurements of current CP and future CP increases. The decaying

implies that, in the long run, companies cannot take the future CP of existing customers for

granted. It also implies that it may be dangerous to estimate customer lifetime value by solely

using current CP. Based on this research, not only current CP should be used for the

estimation of customer lifetime value, but for example also customer satisfaction and

customer loyalty.

Six additional remarks are in order. First, the items indicative of customer satisfaction

were all negatively skewed, and the items counter-indicative of customer satisfaction were all

positively skewed. This is in agreement with the results found in various satisfaction studies

in various domains (e.g., Oliver, 1997; Peterson & Welson, 1992), and suggests that being

satisfied is more or less the default satisfaction state of most persons. Second, the correlation

between customer satisfaction and trust was found to be very high, and matched the

correlation between customer satisfaction and the score on the ACSI. This indicates that there

is a large overlap between the construct of customer satisfaction and the construct of trust in

the context of retail banking. Third, current customer profitability had a large effect on future

customer profitability. Therefore we recommend including current customer profitability as a

predictor in regression models of future customer profitability in the financial services

industry (see also Donkers, Verhoef, & De Jong, 2007). Fourth, the results of the analyses in

the complete dataset and the reduced dataset were nearly similar. Thus, the outliers on the

items reflecting customer satisfaction, trust, customer loyalty, and interest did not influence

the results of the data analyses substantially. Fifth, the effect sizes for customer satisfaction on

future CP were small. This may be due to the omission of important predictors, such as the

total financial means of a customer (Chapter 3), in the regression analyses (e.g., Hays, 1988,

p. 655). Therefore we suggest including measurements of the total financial means of

customers in future research into the influence of customer satisfaction on future CP. Sixth,

the generalisability of the results of the study into the relation between customer satisfaction

and future CP has to be investigated. The sample was drawn from the research panel of the

company, and it cannot be ruled out that persons who were willing to participate in the panel

have a different attitude towards banking than persons who were not willing to participate in

the panel, and that the attitude towards banking influences the relation between customer

satisfaction and future CP. Therefore, we advocate research into the generalisability of the

results of the present study to other groups and companies within the financial service

industry.

7 Conclusion

So far, the results of the first empirical study yielded much evidence for construct validity,

meaning that the results warrant the interpretation of the scale scores in terms of satisfaction

with the company. However, the validation study was not completed because two hypotheses

regarding the contamination of scale scores by method related irrelevant variance were not

tested. These hypotheses were tested in the second empirical study (Chapter 8). Because the

test of these hypotheses yielded further information about the meaning of the scale scores in

the first empirical study, we prefer to present the final conclusions about the validity of

measurement after the presentation of the results of the second empirical study.

Chapter 7

Method of the second empirical study into customer satisfaction with

1 Introduction

The purpose of the second empirical study into customer satisfaction with BANK was to test

hypothesis 14 (i.e., the satisfaction scores are not affected by the midpoint response style) and

hypothesis 15 (i.e., the satisfaction scores are not affected by the extreme response style).

Testing these hypotheses required the measurement of (a) customer satisfaction, (b) general

midpoint responding, and (c) general extreme responding. We decided to operationalise

customer satisfaction on the basis of the 9-item measurement instrument (see Chapter 5),

because it was our purpose to combine the conclusions of the second empirical study with

those of the first empirical study. Furthermore, we decided to operationalise general midpoint

responding as a participant’s proportion of responses in the middle response category of

rating scales of items, and general extreme responding as a participant’s proportion of

responses in the extreme response categories of rating scales of items (Chapter 8).

Greenleaf (1992b), Van Herk (2000), and Baumgartner and Steenkamp (2001, 2006)

noted that measures of general midpoint responding and general extreme responding have to

be based on persons’ responses to many items with low inter-item correlations. This is in

agreement with Paulhus’ (1991, p. 49) remark that that persons exhibiting consistent extreme

response behaviour across time and stimuli may be said to have an extreme response style.

For this reason, Greenleaf (1992b) and Van Herk (2000) operationalised extreme response

style as a participant’s proportion of responses in the extreme response categories of rating

scales of various items. Generalising Paulhus’ (1991, p. 49) remark to midpoint responding,

persons exhibiting a consistent midpoint response behaviour across time and stimuli may be

said to have a midpoint response style. The midpoint response style may be operationalised as

a participant’s proportion of responses in the middle response category of rating scales of

various items.

Dependence of the operationalisations of response styles on operationalisations of the

construct of interest would complicate research into the contamination of measurements of the

construct of interest by response styles (Oort, 1996, pp. 13-14). For example, assume that the

measurement of general extreme responding was done on the basis of items reflecting the

construct of interest. Then a high score on general extreme responding can be achieved by

answering positively to the items indicative and negatively to the items counter-indicative of

the construct of interest. In that instance, a high measurement value for general extreme

responding might reflect a high preference for extreme responding as well as a high value on

the construct of interest, and these two possibilities cannot be distinguished. To prevent that

measurements of general midpoint responding and general extreme responding partly reflect

customer satisfaction, the items used for the former measurements had to be unrelated to

customer satisfaction. For this reason, we decided to measure four constructs, which we

expected to be unrelated to customer satisfaction, and to use the items reflecting these

constructs to compose the measures for stylistic responding. The constructs were (a)

expectations with respect to personal spending power, (b) expectations with respect to the

Dutch economy, (c) involvement with banking matters, and (d) understanding of the Dutch

banking market. Because the response format of items may affect stylistic responding (Van

Herk, 2000, p. 59), we used identical response formats for all items used in the study.

The second empirical study was conducted in August 2007, which was approximately

two years after the first empirical study. This chapter discusses the method used in the second

empirical study. It encompasses an outline of the operationalisations of the constructs, the

questionnaire, the target population, the sample, the procedure, and the data.

2 Operationalisations

The design of the questionnaire, the format of the items, and the wording of the items were

based upon general principles concerning survey research as formulated by Sudman and

Bradburn (1982), Sheatsley (1983), Belson (1986), and Dillman et al. (1998). All items used

were 5-point rating scale items. Similar to the first empirical study, we included a no answer

option in the response options of the items, and we varied the ordering of items within the

groups of items reflecting a construct, across different administrations of the questionnaire.

The operationalisations of the five constructs were the following.

Customer satisfaction was operationalised by means of nine Likert items with five ordered

response categories each, ranging from totally agree (which was scored 4) to totally disagree

(which was scored 0) (Chapter 5; Table 1). Also in the sample used in the second study, we

expected the nine items to constitute a scale according to the MH model.

Expectations with respect to personal spending power

The customers’ positive expectations with respect to personal spending power (EPSP) were

measured using two items reflecting this concept (Table 1). Each item had five ordered

response categories that ranged from totally agree (which was scored 4) to totally disagree

(which was scored 0). We expected the two items to be negatively correlated.

Table 1: Items Reflecting Expectations Regarding Personal Spending Power Code Item Score range

Q6a I expect that my spending power will increase next year 0 - 4

Q6d* In five years my spending power will be lower than today 0 - 4

*= item is counter-indicative of the concept

Expectations with respect to the Dutch economy

The customers’ positive expectations with respect to the Dutch economy (EDE) were

measured using two items reflecting this concept (Table 2). Each item had five ordered

response categories that ranged from totally agree (which was scored 4) to totally disagree

(which was scored 0). We expected the two items to be negatively correlated.

Table 2: Items Reflecting Expectations Regarding the Dutch Economy Code Item Score range

Q7b* I expect that the Dutch economy will decrease next year 0 - 4

Q7c In five years, the Dutch economy will be better than today 0 - 4

Involvement with banking matters

The customers’ involvement with banking matters (labeled involvement) was measured using

four items reflecting this concept (Table 3). Each item had five ordered response categories

that ranged from totally agree (which was scored 4) to totally disagree (which was scored 0).

We expected the four items to be positively correlated after having been scored in the same

direction.

Table 3: Items Reflecting Involvement With Banking Matters Code Item Score range

Q8b I find banking matters very important 0 - 4

Q8c Arranging banking matters properly makes life easier 0 - 4

Q8d* I find banking matters boring 0 - 4

Q8e* Banking matters leave me cold 0 - 4

Understanding of the Dutch banking market

The customers’ understanding of the Dutch banking market (labeled understanding) was

measured using four items reflecting this concept (Table 4). Each item had five ordered

response categories that ranged from totally agree (which was scored as 4) to totally disagree

(which was scored as 0). We expected the four items to be positively correlated after the

correct scoring.

Table 4: Items Reflecting Understanding of the Dutch Banking Market Code Item Score range

Q9a I know the pros and cons of the retail banks in the Netherlands 0 - 4

Q9b* I find it difficult to judge the quality of BANK 0 - 4

Q9c* I find it difficult to compare the quality of retail banks 0 - 4

Q9d I know exactly what I may expect from BANK 0 - 4

3 The questionnaire

The questionnaire (Appendix 3; in Dutch) was composed of the items reflecting customer

satisfaction, EPSP, EDE, involvement, and understanding. In addition, some items were

included in the questionnaire for business purposes, and some other items were included to

optimise the design of the questionnaire. For example, several items regarding product

possession and contacts with the company were included in order to elicit the participant’s

memories of the company, before the measurement of satisfaction with the company started.

The questionnaire was pre-tested in a small sample (N = 3). The pre-tests demonstrated that it

took a participant approximately 15 minutes to complete the questionnaire, which we

considered acceptable.

4 Procedure

The survey was administered via the Internet to the members of the company’s research

panel. The comparability of the target population (i.e., mature retail customers of a Dutch

bank), the research panel, and the final sample is discussed shortly. Panel members were

invited by E-mail to participate in the survey. The questionnaire was made available at a site

of the market research agency that managed the survey. The questionnaire was accessible

from 24 August 2007 until 3 September 2007. The persons had access to the site on the basis

of a password and were identified on the basis of a customer-id. After a person completed the

questionnaire, the data were uploaded to the agency. The participants received a small

incentive (i.e., saving points valued 10 euro). This is the common fee that the company paid

to panel members that responded to a survey of medium length.

5 Data

the participants to the survey items (note that a no answer response was scored as a missing

value). In order to enrich the raw data, the file was merged with the marketing database. The

merging was executed on the basis of customer-id, and it was successful for all participants.

Subsequently, three variables were added to the file, (a) customer segment ultimo September

2007, (b) gender, and (c) age ultimo September 2007.

6 Target population, panel, and sample

Similar to the first empirical study, the target population consisted of the mature retail

customers of a Dutch bank. The participants were registered by the company as the primary

owner of at least one banking product provided by the company.

A total of 2972 persons were invited to participate in the survey. They were mature

retail customers who, in August 2007, participated in the company’s research panel. The

panel members had agreed to participate in marketing research via the Internet. The

agreement encompassed that (a) the company is free to approach the person for marketing

research, (b) the person is free to participate in the research or to decline, (c) the company is

allowed to use the survey data for research purposes only, and (d) the company is not allowed

to distribute any personalised data to third parties. All panel members could be approached by

E-mail, and had a unique customer-id that was used for identification purposes.

The research panel differed in three ways from the target population. First, because the

company’s most valuable customers were overrepresented in the research panel, the panel

differed significantly (χ2(2) = 1244, p < 0.001) from the target population with respect to the

distribution of customer segment (Table 5). Second, the panel differed significantly (χ2(2) =

212, p < 0.001) from the target population with respect to the distribution of gender. Males

were overrepresented in the panel (see Table 5). This was partly due to the overrepresentation

of males among the segment Top Customers (i.e., the segment that was overrepresented in the

research panel), and partly to unknown causes. Third, the panel differed significantly (χ2(2) =

191, p < 0.001) from the target population with respect to the distribution of age group (Table

The response rate in the study was approximately 41% (N = 1227). Table 5 shows the

distributions of customer segment, gender, and age group within the company, the panel, and

the research sample. The research sample differed significantly from the target population

with respect to customer segment (χ2(2) = 710, p < 0.001), gender (χ2(2) = 144, p < 0.001),

and age group (χ2(2) = 110, p < 0.001). Furthermore, the research sample differed

significantly from the remainder of the panel with respect to customer segment (χ2(2) = 30, p

< 0.001), gender (χ2(2) = 14, p < 0.001), and age group (χ2(2) = 22, p < 0.001). Thus,

respondents differed significantly from non-respondents with respect to customer segment,

gender, and age group. This was in line with the first empirical study (see Chapter 5).

Table 5: Distribution (Percentages) of Customer Segment, Gender and Age Group in the Study Company Panel Sample

Customer segment

Top 34 64 70

Standard 41 25 22

Development 25 11 8

Gender

Female 44 33 29

Male 52 65 69

Unknown 4 2 2

Age group

18 to 39 years 34 26 22

40 to 59 years 37 49 50

Chapter 8

Results of the second empirical study into customer satisfaction with BANK

1 Introduction

This chapter presents and discusses the results of the second empirical study in which

hypothesis 14 (i.e., the customer satisfaction scores are not affected by the midpoint response

style) and hypothesis 15 (i.e., the customer satisfaction scores are not affected by the extreme

response style) were investigated. First, we discuss preliminary analyses of which the purpose

was to prepare the data for the subsequent analyses. Second, we discuss measurement

analyses, which aimed at checking whether the MH model fitted the items for customer

satisfaction and at constructing scales for stylistic responding. Third, we discuss the results of

the tests of hypotheses 14 and 15. Fourth, we discuss the generalisability of the results. Fifth,

based on both empirical studies, we discuss the conclusions regarding the validity of

2 Preliminary data analyses

The dataset containing the raw data was converted into a SAS dataset, and the items that were

assumed to be counter-indicative of the constructs (see the description of the measurement

instruments in Chapter 7) were recoded in the opposite direction. Furthermore, we (a)

examined the distribution characteristics of the variables in the dataset, (b) explored the data

quality, (c) conducted missing data analyses, and (d) conducted outlier analyses.

To examine the distribution characteristics of the variables, we computed the

histograms and descriptive statistics of all variables in the dataset. For this purpose, proc

univariate (SAS STAT) and proc means (SAS STAT) were used. The histograms (not shown

here) demonstrated that all variables were single peaked, and that many were negatively

skewed. This finding was corroborated by descriptive statistics (Table 1).

The correlations between the items reflecting customer satisfaction with the retail bank

and the items reflecting expectations regarding personal spending power (EPSP) were

examined. For this purpose, proc corr (SAS STAT) was used. Following our expectations, (a)

the items reflecting customer satisfaction were highly correlated, (b) the items reflecting

EPSP were highly correlated, and (c) the items reflecting customer satisfaction and the items

reflecting EPSP were almost uncorrelated (Table 2). This result suggested that participants did

not respond randomly but instead responded to the items’ content.

Because it was required that the items reflecting EPSP, expectations regarding the

Dutch economy (EDE), involvement with banking matters (involvement), and understanding

of the Dutch banking market (understanding) were unrelated to customer satisfaction, the

correlations between the items reflecting these constructs and the items reflecting customer

satisfaction were computed. Table 3 shows that two items reflecting understanding (i.e., Q9b:

I find it difficult to judge the quality of BANK and Q9d: I know exactly what I may expect from

BANK; Table 1) correlated substantially with the items reflecting customer satisfaction. The

other items reflecting understanding, and the items reflecting EPSP, EDE, and involvement

were almost uncorrelated with the items reflecting customer satisfaction. This result

strengthened our confidence in the usefulness of the data for the purpose of the second

empirical study, which was the testing of hypotheses 14 and 15.

The items reflecting customer satisfaction, EPSP, EDE, involvement, and

understanding showed few missing data (i.e., 5% or less; see Table 1). Thus, following the

strategy explained in Chapter 6, item scores were imputed by means of method TW-E

(Bernaards & Sijtsma, 2000; Van Ginkel et al., 2007). As expected, the descriptive statistics

of the items before imputation were almost identical to the descriptive statistics of the items

after imputation. Some participants (N = 41) left more than 50 percent of the items reflecting

customer satisfaction, EPSP, EDE, involvement, or understanding unanswered (Table 4).

These participants were considered outliers, and we created indicator variables to mark them

in the dataset.

To detect multivariate outliers, the leverage statistic (see Chapter 6) was computed by

means of a regression analysis using customer-id as the criterion variable, and 21 items

reflecting customer satisfaction, EPSP, EDE, involvement, and understanding as the predictor

variables (Tabachnick & Fidell, 2007, pp. 75-76, 111-112). The analysis yielded 60

participants with a significant (p < 0.001) leverage value. Visual inspection of the data

revealed that these participants tended to give extremely positive or extremely negative

responses to the items. Furthermore, the inspection demonstrated that the two participants

with the highest leverage value had alternated extremely positive and extremely negative

responses to different items having similar content.

It was suspected that the two participants with the highest leverage value had responded

inconsistently to the items. An indicator variable was created to mark them in the dataset. This

variable was joined with the variables marking the participants who left the majority of items

reflecting a particular construct unanswered (Table 4). The union of these variables marked

43 outliers in the dataset. In agreement with the first empirical study, the results from analyses

with outliers and analyses without outliers were examined. Henceforth, the dataset including

these outliers is labeled the complete dataset, and the dataset without these outliers is labeled

the reduced dataset.

Table 1: Descriptive Statistics of Items Reflecting Customer Satisfaction, EPSP, EDE, Involvement, and Understanding (Before Imputation; N = 1227) Code Label Nmiss Mean SD Skewness

Customer satisfaction items Q3a At BANK I feel at home 1 2.88 0.81 -0.82

Q3b I am satisfied with BANK 0 2.86 0.81 -1.12

Q3d* There are good reasons to leave BANK 7 2.93 1.05 -0.85

Q3e* I have mixed feelings about BANK 6 2.73 1.07 -0.62

Q3g BANK meets all my requirements for a bank 3 2.62 0.95 -0.85

Q4a Last year I had a pleasant relationship with BANK 4 2.75 0.80 -0.74

Q4b BANK has met my expectations 1 2.69 0.90 -1.02

Q4c* I have regretted my choice for BANK 8 3.21 0.84 -1.09

Q4d* Last year I had some problems with BANK 8 2.91 1.05 -0.89

EPSP items Q6a I expect that my spending power will increase next year 28 1.90 0.89 -0.07

Q6d* In five years my spending power will be lower than today 27 2.17 1.00 -0.14

EDE items Q7b* I expect that the Dutch economy will decrease next year 22 2.26 0.83 -0.33

Q7c In five years, the Dutch economy will be better than today 28 2.09 0.79 -0.16

Involvement items Q8b I find banking matters very important 0 2.74 0.75 -0.75

Q8c Arranging banking matters properly makes life easier 5 2.96 0.57 -1.04

Q8d* I find banking matters boring 0 2.58 0.89 -0.43

Q8e* Banking matters leave me cold 3 2.99 0.77 -0.79

Understanding items Q9a I know the pros and cons of the retail banks in the Netherl. 21 1.89 0.88 0.01

Q9b* I find it difficult to judge the quality of BANK 3 2.43 0.87 -0.44

Q9c* I find it difficult to compare the quality of retail banks 8 1.71 0.95 0.39

Q9d I know exactly what I may expect from BANK 5 2.54 0.77 -0.75

*= scored reversely

Table 2: Correlations Between 2 Items reflecting Customer Satisfaction and 2 Items Reflecting EPSP Q3a Q3b Q6a Q6d*

At BANK I feel at home Q3a 0.74 -0.04 0.03

I am satisfied with BANK Q3b -0.05 0.02

I expect that my spending power will increase next year Q6a 0.62

In five years my spending power will be lower than today Q6d*

* = scored reversely Table 3: Correlations Between Items Reflecting Customer Satisfaction (Columns) and Items Reflecting Other Constructs (Rows) Q3a Q3b Q3d Q3e Q3g Q4a Q4b Q4c Q4d

Q6a -0.04 -0.05 -0.07 -0.04 -0.03 0.03 -0.02 -0.03 0.00

Q6d 0.04 0.02 0.00 0.01 0.03 0.08 0.04 0.04 0.05

Q7b 0.06 0.05 0.08 0.09 0.03 0.07 0.06 0.05 0.05

Q7c 0.04 0.03 0.03 0.06 0.03 0.04 0.04 0.02 0.03

Q8b 0.08 0.01 0.03 -0.01 -0.02 0.04 -0.02 0.04 -0.03

Q8c 0.12 0.08 0.09 0.06 0.10 0.11 0.08 0.13 0.03

Q8d 0.11 0.05 0.06 0.09 0.01 0.07 0.04 0.08 0.04

Q8e 0.06 -0.03 0.02 0.02 -0.04 0.04 -0.03 0.05 0.00

Q9a -0.08 -0.09 -0.10 -0.09 -0.13 -0.09 -0.12 -0.08 -0.10

Q9b 0.24 0.15 0.14 0.18 0.14 0.18 0.15 0.20 0.10

Q9c -0.04 -0.03 -0.03 -0.01 -0.06 -0.06 -0.06 -0.02 -0.04

Q9d 0.44 0.44 0.35 0.40 0.42 0.43 0.45 0.40 0.35

For the legenda see Table 1

Table 4: Number of Participants Leaving More Than Half of the Items Unanswered Customer

satisfaction

EPSP EDE Involvement Understanding

N 0 25 21 0 2

3 Mokken scale analysis of customer satisfaction

Customer satisfaction was operationalised using the measurement instrument presented in

Chapter 5 (Chapter 5; Table 1). In the first empirical study (Chapter 6), it was demonstrated

that the nine items constituted a scale according to the MH model. We hypothesised that the

nine items also constituted a scale according to the MH model in the second empirical study.

To test this hypothesis, Mokken scale analysis was done using MSPwin5.0 (Molenaar

& Sijtsma, 2000). First, the dimensionality of the item set was investigated using the

confirmatory strategy (Chapter 4). Second, the assumption of monotonicity was investigated

(Chapter 4). Third, the scale-scores statistics were computed (Chapter 4). Fourth, the

scalability of the item set within distinct customer segments, gender groups, and age groups

(Chapter 7) was investigated. Fifth, univariate analyses of variance were done to test whether

subgroups differed significantly with respect to the scale scores. For this purpose, proc GLM

(SAS STAT) was used. Sixth, in order to examine the effect of outliers on the results, the

analyses were repeated with the reduced dataset (i.e., the dataset without outliers, see Section

Confirmatory Mokken scale analyses (item selection method = Test) demonstrated that

the nine items constituted a strong Mokken scale with a total-scale scalability coefficient H

equal to 0.67 and a reliability coefficient rho equal to 0.93 (Table 5). The lowest item

scalability coefficient Hi was equal to 0.57, which is well above the default lowerbound for Hi

used in exploratory analyses (i.e., lowerbound Hi = 0.3). The check for item monotonicity on

the basis of the default options in MSPwin5.0 (i.e., Minvi = 0.03 and Minsize = 122, which is

10 percent of the sample) did not reveal violations of the assumption of monotonicity. This

means that the ISRF’s of all items increased across all rest-score groups. Thus, the MH model

fitted the data well.

The customer satisfaction scale-score distribution is presented in Figure 1. The

distribution was significantly skewed to the left (p < 0.001). Furthermore, the histogram

demonstrates peaks for the scale-scores 27, 31, and 36. The peak for the scale-score 27 was

mainly caused by participants who agreed with all items indicative of customer satisfaction,

and disagreed with all items counter-indicative of customer satisfaction (i.e., 66 percent of the

participants having scale-score 27 responded agree to the five items indicative of customer

satisfaction and disagree to the four items counter-indicative of customer satisfaction). The

peak for scale-score 31 was mainly caused by participants who agreed with all items

indicative of customer satisfaction, and strongly disagreed with all items counter-indicative of

customer satisfaction (i.e., 65 percent of the participants having scale-score 31 responded

agree to the five items indicative of customer satisfaction and totally disagree to the four

items counter-indicative of customer satisfaction). The peak for scale-score 36 was caused by

participants who strongly agreed with all items indicative of customer satisfaction, and

strongly disagreed with all items counter-indicative of customer satisfaction, because scale-

score 36 could only be achieved by responding totally agree to the five items indicative of

customer satisfaction and totally disagree to the four items counter-indicative of customer

satisfaction. It may be noted that the distribution of scale scores in the first empirical study did

not contain sharp peaks for the scale-scores 27, 31, and 36 (Chapter 6, Figure 3). This result is

further discussed in Section 5 of the present chapter.

Mokken scale analyses using the grouping variables customer segment (valued Top

Customers, Standard Customers, and Development Customers; see Chapter 7), gender (valued

female, male, and missing), and age group (valued 18 to 39 years, 40 years to 59 years, and 60

years onwards; see Chapter 7) demonstrated that the nine items also constituted a strong

Mokken scale (i.e., H > 0.5) in each subgroup (Table 5). The checks for item monotonicity in

the subgroups yielded a significant violation of the assumption of monotonicity in the

subgroup Top Customers. This violation was due to a decrease of the estimated ISRF for Q4c

>= 1 (I have regretted my choice for BANK; Table 1). Because the magnitude of the violation

was small (i.e., the proportion of responses Q4c >= 1 decreased from 1.00 in the middle rest-

score group to 0.97 in the highest rest-score group), we considered it unimportant, and we

concluded that the MH model fitted the data in subgroups well enough.

Univariate analyses of variance demonstrated that the customer segments and age

groups differed significantly with respect to the customer satisfaction scale-scores (Table 6).

Furthermore, the histograms (not shown here) demonstrated peaks for the scale-scores 27, 31,

and 36 in all subgroups investigated. Thus, the peaks cannot be attributed to particular

customer segments, gender groups, or age groups.

s with

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36

Figure 1: Distribution of customer satisfaction scores in the complete dataset (N = 1227, mean = 25.60, SD = 6.66, and skewness = -0.86) The analyses of the reduced dataset yielded similar results as the analyses of the

complete dataset. Confirmatory Mokken scale analyses (item selection method = Test) yielded

a strong Mokken scale with a total-scale scalability coefficient H equal to 0.67 and a

reliability coefficient rho equal to 0.93 (Table 7). The check for item monotonicity on the

basis of the default options (i.e., Minvi = 0.03 and Minsize = 122, which is 10 percent of the

sample) did not reveal violations of the assumption of monotonicity. Thus, the MH model

fitted the data in the reduced dataset well.

Mokken scale analyses using the grouping variables customer segment, gender, and age

group yielded a strong Mokken scale (i.e., H > 0.5) in each subgroup (Table 7). The checks

for item monotonicity on the basis of the default options (i.e., Minvi = 0.03 and Minsize =

122, which is 10 percent of the sample) yielded a significant violation of the assumption of

monotonicity for item Q4c (Table 1) in the segment Top Customers, but the magnitude of the

violation was small. Therefore, we considered it unimportant, and we concluded that the MH

model fitted the data in the subgroups well enough.

Figure 2 shows the customer satisfaction scale-score distribution. The distribution was

significantly skewed to the left (p < 0.001), and there were peaks for scale-scores 27, 31, and

36 (66 percent of the participants having scale-score 27 responded agree to the five items

indicative of customer satisfaction and disagree to the four items counter-indicative of

satisfaction, 66 percent of the participants having scale-score 31 responded agree to the five

items indicative of customer satisfaction and totally disagree to the four items counter-

indicative of customer satisfaction, and all participants having scale-score 36 responded

totally agree to the five items indicative of customer satisfaction and totally disagree to the

four items counter-indicative of customer satisfaction). Similar distributions of scale scores

were found the customer segments, gender groups, and age groups. Univariate analyses of

variance demonstrated that the customer segments and the age groups differed significantly

with respect to the scale scores (Table 8).

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36

Figure 2: Distribution of customer satisfaction scores in the reduced dataset (N = 1184, mean = 25.69, SD = 6.61, and skewness = -0.85)

s with

4 Measures for stylistic responding

Preliminary analyses

Measures of general midpoint responding and general extreme responding were constructed

on the basis of items with (a) low inter-item correlations, and (b) low correlations with

customer satisfaction (see Chapter 7). We constructed these measures on the basis of four

constructs (i.e., EPSP, EDE, involvement, and understanding), which were hypothesised to be

unrelated to customer satisfaction. This hypothesis was tested by means of CFA (e.g., Bollen,

1989; Oort, 1996). A factor model was specified using the nine items reflecting customer

satisfaction, the two items reflecting EPSP, the two items reflecting EDE, the four items

reflecting involvement, and the four items reflecting understanding (Figure 3). The fit of the

model was evaluated on the basis of the goodness of fit index (GFI), the normed fit index

(NFI), the non-normed fit index (NNFI), and the AMI’s (Oort, 1996, p. 49; see also Chapter 6,

Section 4). Furthermore, the correlations between the factors were inspected.

CFA was done using proc calis (SAS STAT). The goodness of fit indices demonstrated

that the factor model did not fit the data well (because indices having a value of 0.9 or higher

indicate an acceptable fit (Bollen, 1989, pp. 269-281), we required a value of 0.9 or higher for

each index). Furthermore, the AMI’s (Table 9) demonstrated that two items reflecting

understanding (i.e., Q9b: I find it difficult to judge the quality of BANK, and Q9d: I know

exactly what I may expect from BANK; Table 1) were significantly biased (i.e., p < 0.001)

with respect to customer satisfaction (i.e., participants with a high value on customer

satisfaction were more inclined to respond positively to these understanding-items (note that

item Q9b was scored reversely; Section 2) than participants with a low value on customer

satisfaction, even when understanding is controlled for). Because it was required that the

items used for measuring general stylistic responding did not reflect customer satisfaction

(Chapter 7), we decided not to use these items for the measurement of stylistic responding.

Because the first factor model did not fit the data, a second factor model was tested. The

second factor model was specified using the same items for customer satisfaction, EPSP,

EDE, and involvement, and the two remaining items reflecting understanding (i.e., Q9a and

Q9c; Table 1). The second factor model fitted the data well (Table 10; the second factor

model), and none of the AMI’s (not shown here) was significant. Furthermore, the absolute

correlations between the factors reflecting customer satisfaction, EPSP, EDE, involvement,

and understanding (Table 11) were considered sufficiently low for the purpose of the current

study. Therefore, we decided to use the items reflecting EPSP, EDE, and involvement, and

. . . F1

I’s i

two items reflecting understanding (i.e., Q9a and Q9c; Table 1), for the construction of the

measures of general midpoint responding and general extreme responding.

Table 10: Goodness of Fit of the Factor Models for Customer Satisfaction, EPSP, EDE, Involvement, and Understanding. First Factor Model Second Factor Model

CD (N = 1227) RD (N = 1184) CD (N =1227) RD (N = 1184)

GFI 0.89 0.89 0.93 0.93

NFI 0.87 0.87 0.92 0.93

NNFI 0.86 0.87 0.92 0.93

CD is complete dataset; RD is reduced dataset.

Table 11: Inter-Factor Correlations in the Second Factor Model (Upper Triangle = Complete Dataset; Lower Triangle = Reduced Dataset) Satisfaction EPSP EDE Involvement Understanding

Satisfaction -0.02 0.07 0.07 -0.12

EPSP -0.02 0.48 0.06 0.11

EDE 0.08 0.49 0.09 0.05

Involvement 0.08 0.06 0.08 0.28

Understanding -0.12 0.12 0.04 0.28

General midpoint responding

General midpoint responding was defined as the participant’s proportional use of the middle

response category (i.e, corresponding to score 2), which may vary between zero (if zero

responses were in the middle response category) and one (if all responses were in the middle

response category). To test the hypothesis that satisfaction scores were not affected by the

midpoint response style, a measure of general midpoint responding was constructed. For this

purpose, the two items reflecting EPSP, the two items reflecting EDE, the four items

reflecting involvement, and the two remaining items reflecting understanding (i.e., Q9a and

Q9c; Table 1) were used. Missing values were excluded from the operationalisation, because

they do not provide information about general midpoint responding. The scores on the

measure of general midpoint responding ranged from zero to one, with a mean equal to 0.29

(Table 12). The reliability (i.e., coefficient alpha) of the scores was valued 0.59, which is

rather low but perhaps high enough for research purposes.

Midpoint responding to customer satisfaction items

To explore whether general midpoint responding was related to midpoint responding to

customer satisfaction items, a measure of midpoint responding to customer satisfaction items

was constructed. The measure of midpoint responding to customer satisfaction items was

constructed similar to the measure of general midpoint responding. However, for the present

measure the nine items reflecting customer satisfaction were used. The scores on the measure

of midpoint responding to customer satisfaction items ranged from zero to one, with a mean

of 0.17 (Table 12). The reliability (i.e., coefficient alpha) of the scores was valued 0.80.

General extreme responding

General extreme responding was defined as the participant’s proportional use of the extreme

response categories (i.e., corresponding to scores 0 and 4), which may vary between zero (if

zero responses were in the extreme response categories) and one (if all responses were in the

extreme response categories). To test the hypothesis that customer satisfaction scores were not

affected by the extreme response style, a measure of general extreme responding was

constructed. For this purpose, the same items were used that were also used for the

construction of the measure for general midpoint responding. Missing values were excluded

from the operationalisation, because they do not provide information about extreme

responding. The scores on the measure of general extreme responding ranged from zero to

0.80, with a mean of 0.10 (Table 12). The reliability (i.e., coefficient alpha) of the scores was

valued 0.68.

Extreme responding to customer satisfaction items

To explore whether general extreme responding was related to extreme responding to

customer satisfaction items, a measure of extreme responding to customer satisfaction items

was constructed. The measure of extreme responding to customer satisfaction items was

constructed similar to the measure of general extreme responding. However, for the present

measure the nine items reflecting customer satisfaction were used. The scores on the measure

of extreme responding to satisfaction items ranged from zero to one, with a mean of 0.26

(Table 12). The reliability (i.e., coefficient alpha) of the scores was valued 0.89.

Table 12: Descriptive Statistics of General Midpoint Responding (GMR), Midpoint Responding to Customer Satisfaction Items (MRCSI), General Extreme Responding (GER), and Extreme Responding to Customer Satisfaction Items (ERCSI) Complete dataset (N = 1227)

Min Max Median Mean SD Skewness

GMR 0 1 0.30 0.29 0.20 0.72 *

MRCSI 0 1 0.11 0.17 0.23 1.47 *

GER 0 0.83 0 0.10 0.15 1.86 *

ERCSI 0 1 0.11 0.26 0.31 1.12 *

Reduced dataset (N = 1184)

Min Max Median Mean SD Skewness

GMR 0 1 0.30 0.29 0.20 0.72 *

MRCSI 0 1 0.11 0.17 0.23 1.49 *

GER 0 0.80 0 0.10 0.15 1.87 *

ERCSI 0 1 0.11 0.26 0.31 1.11 *

* = p < 0.001

5 Test of the hypotheses

The hypotheses 14 and 15 were tested in a similar way. First, the correlation was computed

between stylistic responding and customer satisfaction scores. This was done using proc corr

(SAS STAT). Second, to detect possible non-monotone relations between stylistic responding

and customer satisfaction scores, the stylistic responding scores were plotted against the

customer satisfaction scores. This was done using MS Excel. Third, the correlation was

computed between stylistic responding and stylistic responding to customer satisfaction items.

This was done using proc corr (SAS STAT).

Hypothesis 14

Hypothesis 14 was: the satisfaction scores are not affected by the midpoint response style.

The correlation between general midpoint responding and customer satisfaction was not

significant (Table 13). Furthermore, the plot of the customer satisfaction scores against the

general midpoint responding scores (Figure 4; complete dataset) did not demonstrate a

distinct non-monotone relation. There was a decrease in the standard deviation of the

customer satisfaction scores with increasing general midpoint responding scores, but the

magnitude of the decrease was small (Table 14) and we considered it unimportant. However,

the product-moment correlation between general midpoint responding and midpoint

Table 13: Product-Moment Correlations Between General Midpoint Responding (GMR), Midpoint Responding to Customer Satisfaction Items (MRCSI), and Customer Satisfaction (Satisfaction)

Complete dataset (N = 1227) Reduced dataset (N = 1184)

MRCSI Satisfaction MRCSI Satisfaction

GMR 0.14* -0.03 0.13* -0.03

* = p < 0.001

Table 14: Standard Deviation (SD) of Customer Satisfaction in GMR-Groups (N = Group Size)

Complete dataset (N = 1227)

GMR 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

N 139 220 212 235 176 113 68 33 13 11 7

SD 6.9 7.3 6.9 6.2 6.6 6.1 6.3 5.5 4.5 8.2 6.2

Reduced dataset (N= 1184)

GMR 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

N 131 213 211 220 171 109 67 33 11 11 7

SD 6.9 7.3 6.9 6.1 6.4 6.1 6.3 5.5 4.1 8.2 6.2

-0,2 0 0,2 0,4 0,6 0,8 1 1,2

Figure 4: Plot of general midpoint responding scores versus customer satisfaction scores in the complete dataset (N = 1227). The smallest circle represents one participant and the largest circle represents 35 participants.

responding to customer satisfaction items was significant (Table 13). Because customer

satisfaction was almost unrelated to the items underlying the measure of general midpoint

responding, it is plausible that the correlation was caused by the midpoint response style. This

implies that it is plausible that the customer satisfaction scores were affected by the midpoint

response style. Thus, hypothesis 14 was not supported.

Hypothesis 15

Hypothesis 15 was: the satisfaction scores are not affected by the extreme response style. The

correlation between general extreme responding and customer satisfaction was significant in

the reduced dataset (Table 15). Furthermore, the plot of the customer satisfaction scores

against the general extreme responding scores (Figure 5; complete dataset) showed

heteroscedasticity, which means that the variance of customer satisfaction scores differed

across subgroups with different general extreme responding scores. The distribution of

customer satisfaction scores in subgroups having high general extreme responding scores

appears bimodal. This means that high general extreme responding scores corresponded with

very high or very low customer satisfaction scores. In agreement with this results, the

standard deviation of customer satisfaction scores increased as the general extreme

responding score increased (Table 16). The product-moment correlation between general

extreme responding and extreme responding to customer satisfaction items was also

significant (Table 15). Because customer satisfaction was almost unrelated to the items

underlying the measure of general extreme responding, it is plausible that the correlation was

caused by the extreme response style. Thus, hypothesis 15 was not supported.

Table 15: Product-Moment Correlations Between General Extreme Responding (GER), Extreme Responding to Customer Satisfaction Items (ERCSI), and Customer Satisfaction (Satisfaction)

Complete dataset (N = 1227) Reduced dataset (N = 1184)

ERCSI Satisfaction ERCSI Satisfaction

GER 0.37** 0.04 0.38** 0.07*

* = p < 0.05; ** = p<0.001

Table 16: Standard Deviation (SD) of Customer Satisfaction in GER-Groups (N = Group Size)

Complete dataset (N = 1227)

GER 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

N 701 206 134 93 47 24 10 6 6 0 0

SD 5.5 6.8 7.1 8.5 9.0 10.2 14.8 12.9 12.5 - -

Reduced dataset (N= 1184)

GER 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

N 678 204 131 86 44 21 9 6 5 0 0

SD 5.5 6.8 7.1 8.4 9.1 9.8 13.1 12.9 13.9 - -

1015202530354045

-0,2 0 0,2 0,4 0,6 0,8 1

Figure 5: Plot of general extreme responding scores versus customer satisfaction scores in the complete dataset (N = 1227). The smallest circle represents one participant, and the largest circle represents 120 participants.

6 Discussion

The second empirical study confirmed that the measurement instrument of customer

satisfaction constituted a scale according to the MH-model. Moreover, the results confirmed

that the scale also could be used in different subgroups. This result contributes to the validity

of the scale-score interpretations in terms of customer satisfaction with the company.

The tests of the hypotheses demonstrated that stylistic responding influenced the

customer satisfaction scale-scores. This means, for example, that the extreme scale-scores

were partly due to a high preference for extreme response categories in general. Because the

contamination of the scale scores due to stylistic responding was small (Tables 13 and 15), its

importance for the assessment of construct validity of the scale scores is also small. Still, it

limits the construct validity of the scale scores.

The distribution of scale scores showed remarkable peaks for the scale-scores 27, 31,

and 36. Each peak was mainly caused by a group of participants who responded to all nine

items in a similar way (see Section 3). For example, the peak for the scale-score 36 was

caused by participants who agreed strongly with all items indicative of customer satisfaction

and disagreed strongly with all items counter-indicative of customer satisfaction. Therefore,

we suspect that the peaks were caused by stylistic responding.

Because the measurement instrument for customer satisfaction, the location of customer

satisfaction items in the questionnaire, the composition of the sample, and the mode of

administration were largely similar in the first and the second study, it is possible that stylistic

responding also influenced the scale scores in the first empirical study. However, the

distribution of scale scores in the first empirical study did not show such sharp peaks as the

distribution of scale scores in the second empirical study. Therefore we suspect that stylistic

responding was less prevalent in the first empirical study than in the second empirical study.

The following difference between the methods used in the first empirical study and the

second empirical study may explain the differences between the distributions of the scale

scores found in these studies. In the first empirical study the questionnaire was accompanied

by an extensive E-mail in which persons were invited to participate in the survey and in which

the purpose of the study was explained, whereas in the second empirical study the

questionnaire was accompanied by a succinct E-mail in which persons were invited to

participate in the survey but which did not explain the purpose of the study. The explanation

of the purpose of the study in the former E-mail may have affected the motivation of

participants to complete the questionnaire conscientiously. Therefore, we suspect that

satisficing (e.g., Krosnick, 1999, pp. 546-548) was less prevalent in the first empirical study

than in the second empirical study, and that for that reason stylistic responding also was less

prevalent in the first empirical study than in the second empirical study.

Summarising, the fit of the MH model supports the interpretation of the scale scores

from the second empirical study in terms of customer satisfaction with the company. Because

the content of the measurement instrument also supported that interpretation (Chapter 4),

there is much evidence for construct validity. Still, the tests of the hypotheses indicated that

stylistic responding contaminated the scale scores, and this limits the construct validity of the

scale scores. The contamination of the scale scores may be taken into account in any follow-

up research using the scale scores for customer satisfaction from the second empirical study.

It cannot be ruled out that the scale scores were also contaminated by stylistic

responding in the first empirical study, but there is evidence that contamination of the scale

scores by stylisitc responding in the first empirical study was smaller than in the second

empirical study. Nevertheless, we suggest taking the possibility that the scale scores were

contaminated by stylistic responding into account in any follow-up research using the scale

scores for customer satisfaction from the first empirical study.

7 Conclusions

1 The content of the measurement instrument for customer satisfaction and the results

from the measurement analyses of the empirical studies supported the validity of the

scale-score interpretation in terms of overall satisfaction with the company. Moreover,

the results of the analyses demonstrated that the scale may be used in different customer

populations.

2 The items that were indicative of customer satisfaction and the other items that were

counter-indicative of the construct together constituted a unidimensional scale. This

result supports the conception of dissatisfaction as the opposite of satisfaction on a

bipolar continuum.

3 The quality of the measurement instrument may be improved by the substitution of the

items Q3a (At BANK I feel at home; Table 1) and Q4d (Last year I had some problems

with BANK; Table 1) with other items. This means that it should be investigated whether

the substitution of these items with two other items that reflect customer satisfaction

with a retail bank improves the validity of the measurements of customer satisfaction

with a retail bank.

4 The results of the second empirical study indicate that the scale scores partly reflected

stylistic responding. It is plausible that a part of the extreme satisfaction scores was

caused by a high general preference for extreme response categories. It is possible that

stylistic responding also influenced the scale scores in the first empirical study but

probably to a lesser extent than in the second empirical study.

5 There is strong evidence for the interpretation of the scale scores in terms of satisfaction

with the company in the first empirical study, and fair evidence for such an

interpretation in the second empirical study. Thus, the application of a measurement

instrument in one study may yield better scale scores than the application of the

instrument in another study (see also Messick, 1989, p. 81). This illustrates that

construct validity is a property of score interpretations and not of measurement

instruments, and that construct validity is always a matter of degree (see also Messick,

1989, p.13).

Chapter 9

General discussion

1 The meaning of customer satisfaction

The purpose of this thesis was to unravel the meaning of customer satisfaction in the context

of retail banking. Customer satisfaction is a psychological construct. Psychological constructs

are organisational principles with respect to behaviour. This means that they are schemes

through which we perceive and interpret behaviours of persons. The ontological status of

customer satisfaction as organisational principle constitutes an important component of the

meaning of customer satisfaction.

The meaning of satisfaction is context-specific (Giese & Cote, 2000). Moreover,

satisfaction with a retail bank may be the absence of dissatisfaction for one customer, a

judgement of the performance of the bank for another customer, and an affect for a third

customer. To account for the different manifestations of satisfaction, we defined customer

satisfaction with a retail bank as the valenced response of the customer, directed towards the

retail bank, and evoked by the customer’s experiences with the bank throughout time. This

definition expresses that customer satisfaction with a retail bank encompasses affects and

cognitions that can be placed on a dimension that ranges from negative to positive. Because

the negative response expresses dissatisfaction and the positive response expresses

satisfaction, the definition also covers customer dissatisfaction with a retail bank. This

definition constitutes an important component of the theoretical meaning of customer

satisfaction in the context of retail banking.

Marketing studies (e.g., Anderson et al., 1994; Hennig-Thurau et al., 2002, Oliver,

1997, Verhoef, 2001, Yi, 1990) suggest that customer satisfaction is related to various other

psychological constructs, such as trust, quality, customer loyalty, commitment, word-of-

mouth, and image, and to customer profitability (CP). There is evidence that customer

satisfaction is preceded by quality and trust, and that customer satisfaction precedes customer

loyalty and CP. We hypothesised that the latter relations also applied to customer satisfaction

in the context of retail banking. The hypothesised relations between customer satisfaction and

trust, quality, customer loyalty, and CP constitute the implicit definition of customer

satisfaction in the context of retail banking. This definition also constitutes an important

component of the theoretical meaning of the construct.

The empirical meaning of customer satisfaction is the behaviours that are associated

with customer satisfaction. In the context of retail banking, these are manifestations of

performance evaluations, disconfirmation, expectations, emotions, and regret (also, Oliver,

1997, pp. 316-318, 343-344). These manifestations can be used for the measurement of

customer satisfaction. Because customer satisfaction has a large behavioural domain, we

developed a nine-item measurement instrument for customer satisfaction with a bank, which

covered different manifestations of customer satisfaction. Five items were indicative of

customer satisfaction and four items were counter-indicative of customer satisfaction. The

first empirical study into customer satisfaction with BANK demonstrated that the nine items

constituted a unidimensional scale. This result supported the theoretical notion that customer

satisfaction is the opposite of customer dissatisfaction on a bipolar dimension.

We found positive correlations between customer satisfaction and quality, and between

customer satisfaction and customer loyalty. These results supported our hypotheses

concerning these correlations, but three remarks are in order. First, the measurement of

quality on the basis of items reflecting judgements about products and services provided by

the company resulted in missing data problems and halo effects. We did not find a satisfactory

solution for these problems. Eventually, we re-defined quality as absence of problems, and we

measured quality by means of the total score on the recoded items regarding the experience of

problems with BANK in the preceding twelve months. We found that absence of problems

with BANK in the preceding twelve months was positively correlated with customer

satisfaction with BANK. Second, we found that the customer satisfaction scale-scores were

contaminated by quality. The scale scores were corrected by excluding one item from the

customer satisfaction scale when testing for the correlation between customer satisfaction and

quality. Third, we found that the customer satisfaction scale-scores were contaminated by

customer loyalty. The scale scores were corrected by excluding one item from the customer

satisfaction scale when testing for the correlation between customer satisfaction and customer

loyalty.

The positive effects of customer satisfaction on future CP after one year and future CP

after two years supported the hypothesis that customer satisfaction influences CP, and

confirmed the importance of customer satisfaction in the context of retail banking. We found

that current CP (i.e., CP at the time of the measurement of customer satisfaction) is an

indispensable variable in analyses of the relation between customer satisfaction and future CP.

However, we also found that the size of the effect of current CP on future CP decreased as the

time-lag between current CP and future CP increased. This implies that companies cannot rely

on current CP as a guarantee for future CP, and this warrants taking more than only current

CP into account when estimating customer lifetime value. Furthermore, we found that CP

follows a Pareto-like distribution in the context of retail banking, and that CP had to be

transformed before analysing the relation between customer satisfaction and CP. The latter

results may be useful for the development of methods for investigating the influence of

customer satisfaction on CP and estimating customer lifetime value.

We also found a positive correlation between customer satisfaction and trust, which

supported our hypothesis concerning this correlation. It may be noted that the correlation

between the customer satisfaction scores and the trust scores was as large as the correlation

between the customer satisfaction scores and the ACSI scores. Customers were satisfied with

BANK when they trusted BANK, and dissatisfied with BANK when they did not trust

BANK. This was also an outcome of the pre-tests. There seems to be a large overlap between

the construct of customer satisfaction and the construct of trust in the context of retail

banking. Further research into the generalisability of this result is needed.

The second empirical study demonstrated that the customer satisfaction scores were

contaminated by stylistic responding of the participants. This means, for example, that the

extreme scale-scores were partly due to a high general preference for extreme response

categories. Because the contamination of the scale scores due to stylistic responding was

small, we considered its importance for the construct validity of the scale scores also small.

Still, it limits the construct validity of the scale scores. Therefore we suggested taking the

contamination of the scale scores into account when using these scores for any follow-up

research.

In all, the empirical studies yielded scale scores for customer satisfaction with BANK

and provided much evidence for the construct validity of the scale scores. Therefore we

concluded that the scale scores were rightly interpreted as customer satisfaction with BANK.

The scale scores constitute a special case of the empirical meaning of customer satisfaction.

2 The measurement of psychological constructs in marketing research

Another purpose of this thesis was to select a suitable methodology for the construction of a

measurement instrument for customer satisfaction and the validation of the customer

satisfaction scale-scores. Psychological constructs can be measured by means of

psychological tests (including measurement instruments for typical behaviour; see Chapter 1,

Section 4). For the measurement of psychological constructs in marketing research a test often

consists of a set of items that is administered in a survey. On the basis of a participant’s

responses to these items, his or her position on the scale for the property is inferred.

It is broadly acknowledged that validity of measurement is a key success factor for

satisfaction research and for marketing research in general. However, the practice of construct

validation in marketing research does not comply with theory of validity as formulated by

Messick (1989). We demonstrated (Chapter 3, Section 6) that construct validity was

insufficiently investigated in important satisfaction studies in the marketing literature (see also

Giese & Cote, 2000; Peterson & Wilson, 1992). This hampers the usefulness of satisfaction

research for scientific purposes, such as testing of satisfaction theories, and for business

purposes, such as marketing strategy development.

Construct validity is the appropriateness of test-score interpretations in terms of the

construct of interest (e.g., Cronbach; 1971; Messick, 1989, pp. 13, 34). Churchill’s (1979)

perspective on construct validity, which is the leading perspective in marketing measurement,

conflicts with this conception of construct validity. Churchill’s (1979) perspective is flawed

with respect to the conception of construct validity as a property of a test, the criteria for the

assessment of construct validity, and the procedures for validation research. Construct validity

is a property of test-score interpretations, and not of tests. This means, for example, that the

application of a test may yield valid measurements of a construct in one instance, and less

valid measurements of a construct in another instance (see also Chapter 8, Section 7).

Furthermore, Churchill’s (1979) criteria for the assessment of construct validity, which are

nomological validity, divergent validity, and convergent validity, do not address the two

major threats to construct validity, which are construct underrepresentation and construct-

irrelevant variance (Messick, 1989, 1995). Consequently, Churchill’s (1979) procedures for

validation research, which are MTMM framework and correlating a measure with a criterion

variable, do not suffice for the assessment of construct validity. Moreover, because the

methods applied in MTMM research are often similar, the agreement between two measures

of the same often trait provides evidence for reliability rather than validity (also, see Anastasi,

1988, p. 158). The flaws in Churchill’s (1979) perspective on construct validity justify

adopting of Messick’s (1989, 1995) perspective on construct validity and construct validation

research.

Because the deductive design (Schouwstra, 2000) is in agreement with Messick’s (1989,

1995) perspective on construct validity and validation research, we applied the deductive

design for the development of a test for customer satisfaction with BANK and the construct

validation of the test scores. The deductive design addresses test development and construct

validation for typical-behaviour properties (Table 1):

Table 1: Outline of Construct Validation Within the Deductive Design (Schouwstra, 2000, p. 60) Scientific arguments Construct representation Irrelevant variance

Rationales

a. Formulation

b. Translation

c. Modelling

Of what construct of interest is

Of construct of interest into test content

How test score reflects construct

And what not

And nothing else

Empirical evidence That test score reflects whole of construct And nothing else

Psychometric theory provides useful guidelines for the definition of the construct of

interest, the translation of the construct of interest into test content, and the choice of a

measurement model for modelling the participant’s responses to the test. For example, it is

well-known that single items often yield inadequate measurements of constructs (e.g.,

Messick, 1989, pp. 14, 35), and this may explain why customer satisfaction has to be

measured by means of a multiple-item scale. The empirical research is directed at the

collection of empirical evidence regarding construct representation and irrelevant variance.

Schouwstra (2000, pp. 69-71) suggested formulating and testing hypotheses regarding

construct representation and absence of irrelevant variance. Two remarks concerning the

empirical research are in order. First, it is not feasible to formulate and test all possible

hypotheses regarding construct representation and absence of irrelevant variance. Therefore,

the formulation and testing of hypotheses has to be restricted to the most important

hypotheses, and which are the most important hypotheses remains to some extent arbitrary.

Second, we consider the requirement that the test scores reflect the whole construct and noting

else too rigid. It is not feasible to exclude all possible irrelevant variance in the practice of

psychological measurement. Therefore, construct validity is always a matter of degree (see

also Messick, 1989, p. 13).

The conclusion that contamination of test scores cannot be avoided in the practice of

psychological measurement limits the construct validity of test scores. Therefore, in future

research we suggest to investigate the degree to which test scores are contaminated by other

attributes, and to take any contamination into account when using the test scores for follow-up

research.

In all, the application of the deductive design yielded a scale for customer satisfaction

with BANK and much evidence for the construct validity of the scale scores. Therefore, we

consider the deductive design a useful framework for measurement instrument development

and construct validation in marketing research.

3 Suggestions for future research

First, we suggest further research into the influence of customer satisfaction on CP in retail

banking. We recommend research into the generalisability of the results of the present study

to other groups and companies within the financial services industry. Furthermore, we

recommend future research into the definition and measurement of CP, such as the inclusion

or exclusion of various costs, and the accumulation of profits over longer time periods than

one year.

The second suggestion for future research concerns executing context-specific customer

satisfaction studies. We subscribe to Giese and Cote (2000) that the meaning of customer

satisfaction is context-specific, and that definitions and measures of customer satisfaction also

should be context-specific. We also expect that the antecedents of customer satisfaction are

context-specific. Context-specific customer satisfaction studies may contribute to the further

development of general theory about customer satisfaction.

The third suggestion for future research concerns the development of context-specific

definitions of quality and corresponding measurement procedures. We had much difficulty

with the measurement of quality in the present study. Moreover, different inquiries may

require different definitions and operationalisations of quality. Proper operationalisations of

quality are important for investigating the influence of quality on customer satisfaction, and

such investigations are important for making customer satisfaction actionable for companies.

Fourth, we suggest the deductive design (Schouwstra, 2000) for the measurement of

psychological constructs in marketing research. The marketing literature uses many

psychological constructs, and there appears to be much redundancy in the collection of

constructs. Marketing research may disentangle these constructs, and for that purpose it has to

define and measure them properly. Because Messick’s (1989, 1995) perspective on construct

validity can be put into action by the deductive design, we suggest the deductive design for

the measurement and the validation of measurements of psychological constructs in marketing

research.

4 Concluding remarks

This thesis explored the meaning of customer satisfaction in retail banking, and the usefulness

of psychometric methods for test development and construct validation. It was demonstrated

that, in the context of retail banking, customer satisfaction is manifested in performance

evaluations, disconfirmation, expectations, emotions, and regret. This is a useful result for the

further development of satisfaction theory and for customer satisfaction management in the

financial services industry. It explains why customer satisfaction is not exclusively driven by

technical quality of products, services, and processes. Therefore a bank’s customer

satisfaction management strategy may start with managing technical quality, and having

accomplished that, it may proceed with managing functional quality, complaints handling,

and corporate communication. Furthermore, the thesis provided strong evidence for the

influence of customer satisfaction on CP. This is a useful result for the further development of

satisfaction theory and eventually for marketing strategy development in the industry of retail

banking. Customer satisfaction influencing CP warrants the appointment of customer

satisfaction as a strategic goal of retail banks, the more because the influence of current CP on

future CP decreases when the time lag increases. The thesis also demonstrated that the

application of psychometric methods for the measurement of customer satisfaction yielded

scale scores that can be rightly interpreted as customer satisfaction scores. This is a useful

result for the methodology of marketing research and eventually for the development and

validation of marketing theories.

References

Anastasi, A. (1986). Evolving concepts of test validation. Annual Review of Psychology, 37,

Anastasi, A. (1988). Psychological testing (6th ed.). New York: Macmillan Publishing

Company.

Angoff, W.H. (1988). Validity: An evolving concept. In H. Wainer & H.I. Braun (Eds.), Test

validity (pp. 19-32). Hillsdale, NJ: Lawrence Erlbaum Associates.

Anderson, E.W., Fornell, C., & Lehmann, D.R. (1994). Customer satisfaction, market share

and profitability: Findings from Sweden. Journal of Marketing, 58, 53-66.

Anderson, E.W., & Mittal, V. (2000). Strengthening the satisfaction-profit chain. Journal of

Service Research, 3, 107-123.

Anderson, E.W., Fornell, C., & Mazvancheryl, S.K. (2004). Customer satisfaction and

shareholder value. Journal of Marketing, 68, 172-185.

Baumgartner, H., & Steenkamp, J.B.E.M. (2001). Response styles in marketing research: A

cross-national investigation. Journal of Marketing Research, 38, 143-156.

Baumgartner, H., & Steenkamp, J.B.E.M. (2006). Response biases in marketing research. In

R. Grover & M. Vriens (Eds.), The handbook of marketing research: Uses, misuses and

future advances (pp. 95-109). Thousand Oaks: Sage Publications.

Bearden, W.O., Netemeyer, R.G., & Mobley, M.F. (1993). Handbook of marketing scales:

Multi-item measures for marketing and consumer behavior research. Newbury Park,

CA: Sage Publications.

Belson, W.A. (1981). The design and understanding of survey questions. Aldershot: Gower

Publishing Company Limited.

Belson, W.A. (1986). Validity in survey research. Aldershot: Gower Publishing Company

Limited.

Berens, G.A.J.M. (2004). Corporate branding: The development of corporate associations

and their influence on stakeholder reactions. Doctoral dissertation, Erasmus University,

Rotterdam.

Bernaards, C.A., & Sijtsma, K. (2000). Influence of imputation and EM methods on factor

analysis when item nonresponse in questionnaire data is nonignorable. Multivariate

Behavioral Research, 34, 277-313.

Bloemer, J.M.M. (1993). Loyaliteit en tevredenheid: Een studie naar de relatie tussen

merktrouw en consumententevredenheid. Doctoral dissertation, University of Maastricht,

Maastricht.

Bloemer, J.M.M., & Poiesz, T.B.C. (1989). The illusion of consumer satisfaction. Journal of

Consumer Satisfaction, Dissatisfaction and Complaining Behavior, 2, 43-48.

Bloemer, J.M.M., & Kasper, H.D.P. (1995). The complex relationship between consumer

satisfaction and brand loyalty. Journal of Economic Psychology, 16, 311-329.

Bollen, K.A. (1989). Structural equations with latent variables. New York: Wiley.

Borsboom, D., Mellenbergh, G.J., & Van Heerden, J. (2003). The theoretical status of latent

variables. Psychological Review, 110, 203-219.

Borsboom, D., Mellenbergh, G.J., & Van Heerden, J. (2004). The concept of validity.

Psychological Review, 111, 1061-1071.

Borsboom, D. (2005). Measuring the mind. New York: Cambridge University Press.

Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71, 425-440.

Bouwmeester, S., & Sijtsma, K. (2006) Constructing a transitive reasoning test for 6-to-13

year old children. European Journal of Psychological Assessment, 22, 225-232.

Bradburn, N.M. (1983). Response effects. In P.H. Rossi, J.D. Wright, & A.B. Anderson

(Eds.), Handbook of survey research (pp. 289-328). New York: Academic Press Inc..

Bronner, F., & Kuijlen, T. (2007). The live or digital interviewer: A comparison between

CASI, CAPI, and CATI with respect to differences in response behaviour. International

Journal of Market Research, 49, 167-190.

Buttle, F. (1996). SERVQUAL: Review, critique, research agenda. European Journal of

Marketing, 30, 8-32.

Byrne, B.M. (1989). A primer of LISREL: Basic applications and programming for

confirmatory factor analytic models. New York: Springer-Verlag.

Campbell, D.T., & Fiske, D.W. (1959). Convergent and discriminant validation by the

multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105.

Campbell, D., & Frei, F. (2004). The persistence of customer profitability: Empirical evidence

and implications from a financial services firm. Journal of Service Research, 7, 107-123.

Carnap, R. (1950). Logical foundations of probability. Chicago: University of Chicago Press.

Carnap, R. (1956). The methodological character of theoretical concepts. In H. Feigl & M.

Scriven (Eds.), Minnesota studies in the philosophy of science, Vol I. Minneapolis:

University of Minnesota Press.

Caruana, A. (2002). Service loyalty: The effects of service quality and the mediating role of

customer satisfaction. European Journal of Marketing, 36, 811-828.

Churchill, G.A. (1979). A paradigm for developing better measures of marketing constructs.

Journal of Marketing Research, 16, 64-73.

Churchill, G.A., & Suprenant, C. (1982). An investigation into the determinants of customer

satisfaction. Journal of Marketing Research, 19, 491-504.

Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the

behavioral sciences (2nd ed.). Hillsdale: Lawrence Erlbaum Associates.

Cook, T.D., & Campbell, D.T. (1979). Quasi-experimentation: Design and analysis issues for

field settings. Chicago: Rand McNally.

Coombs, C.H. (1964). A theory of data. New York: John Wiley and Sons.

Cooper, R., & Kaplan, R.S. (1991). The design of cost management systems: Text, cases, and

readings. Englewood Cliffs, NJ: Prentice Hall.

Coulthard, L.J.M. (2004). Measuring service quality: A review and critique of research using

SERVQUAL. The Market Research Society, 46, 479-497.

Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16,

297-335.

Cronbach, L.J., & Meehl, P.E. (1955). Construct validity in psychological tests. Psychological

Bulletin, 52, 281-302.

Cronbach, L.J. (1971). Test validation. In R.L. Thorndike (Ed.), Educational measurement

(pp. 443-507). Washington, DC: American Council on Education.

Cronbach, L.J. (1988). Five perspectives on the validity argument. In H. Wainer & H.I. Braun

(Eds.), Test validity (pp. 3-17). Hillsdale, NJ: Lawrence Erlbaum Associates.

Cronbach, L.J. (1989). Construct validation after thirty years. In R. Linn (Ed.), Intelligence:

Measurement, theory, and public policy (pp. 147-171). Urbana, IL: University of Illinois

Press.

Cronin, J.J., & Taylor, S.A. (1992). Measuring service quality: A reexamination and

extension. Journal of Marketing, 56, 55-68.

Cronin, J.J., & Taylor, S.A. (1994). SERVPERF versus SERVQUAL: Reconciling

perfomance-based and perceptions minus expectations measurement of service quality.

Journal of Marketing, 58, 125-131.

De Ruyter, K., Bloemer. J., & Peeters, P. (1997). Merging service quality and service

satisfaction: An empirical test of an integrative model. Journal of Economic Psychology,

18, 387-406.

Dick, A., & Basu, K. (1994). Customer loyalty: Toward an integrated conceptual framework.

Journal of Marketing Science, 22, 99-113.

Dillman, D.A., Tortora, R.S., & Bowker, D. (1998). Principles for constructing web surveys.

SESRC Technical Report 98-50. Washington State Universtity.

Dillman, D.A., & Bowker, D.K. (2001). The web questionnaire challenge to survey

methodologists. In U.D. Reips & M. Bosnjak (Eds.), Dimensions of internet science (pp.

159-178). Lengerich: Pabst Science Publishers.

Donkers, B., Verhoef, P.C., & De Jong, M.G. (2007). Modeling CLV: A test of competing

models in the insurance industry. Quantitative Marketing and Economics, 5, 163-190.

Embretson, S.E., & Reise, S.P. (2000). Item response theory for psychologists. Mahwah, NJ:

Lawrence Erlbaum Associates.

Fabrigar, L.R., Krosnick, J.A., & MacDougall, B.L. (2005). Attitude measurement:

Techniques for measuring the unobservable. In T.C. Brock & M.C. Green (Eds.),

Persuasion: Psychological insights and perspectives (pp. 17-40). Thousand Oaks, CA:

Fornell, C., & Larcker, D.F. (1981). Evaluating structural equation models with unobservable

variables and measurement error. Journal of Marketing Research, 28, 39-50.

Fornell, C., & Wernerfelt, B. (1987). Defensive marketing strategy by customer complaint

management: A theoretical analysis. Journal of Marketing Research, 24, 337-346.

Fornell, C., & Wernerfelt, B. (1988). A model for customer complaint management.

Marketing Science, 7, 271-286.

Fornell, C. (1992). A national customer satisfaction barometer: The Swedish experience.

Fornell, C., Johnson, M.D., Anderson, E.W., Cha, J., & Bryant, B.E. (1996). The American

customer satisfaction index: Nature, purpose and findings. Journal of Marketing, 60, 7-

Frege, G. (1892). On sence and reference. In P. Geach & M. Black (Eds.), (1952).

Translations of the philosophical writings of Gottlob Frege. Oxford England: Blackwell.

Friman, M. (2004). The structure of affective reactions to critical incidents. Journal of

Economic Psychology, 25, 331-353.

Gardner, H. (1999). Intelligence reframed: Multiple intelligences for the 21st century. New

York: Basic Books.

Garvin, D.A. (1983). Quality on the line. Harvard Business Review, 61, 65-73.

Giese, J.L., & Cote, J.A. (2000). Defining customer satisfaction. Academy of Marketing

Science Review. www.amsreview.org/articles/giese01-2000.pdf.

Goedee, J., Reijnders, W., & Van Thiel, D. (2008). Bankieren in 2020: De impact van

consumentenvertrouwen en technologische ontwikkelingen. Amsterdam: Pearson

Education Benelux.

Gorsuch, R.L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates.

Greenleaf, E.A. (1992a). Improving rating scale measures by detecting and correcting bias

components in some response styles. Journal of Marketing Research, 29, 176-188.

Greenleaf, E.A. (1992b). Measuring extreme response style. Public Opinion Quarterly, 56,

176-188.

Gremler, D.D., & Brown, S.W. (1996). Service loyalty, its nature, importance and

implications. In B. Edvardsson, S.W. Brown, R. Johnston, & E.E. Scheuing (Eds.),

Advancing service quality: A global perspective (pp. 171-180). International Service

Quality Association.

Gremler, D.D., & Brown, S.W. (1999). The loyalty ripple effect: Appreciating the full value

of customers. International Journal of Service Industry Management, 10, 271-299.

Grönroos, C. (1984). A service quality model and its marketing implications. European

Grönroos, C. (1990). Service management and marketing: Managing the moments of truth in

service competition. Lexington, MA: Lexington Books.

Groves, R.M. (1989). Survey errors and survey costs. New York: Wiley.

Gruca, T.S., & Rego, L.L. (2005). Customer satisfaction, cash flow and shareholder value.

Gustafsson, A, Johnsons, M.D., & Roos, I. (2005). The effects of customer satisfaction,

relationship commitment dimensions, and triggers on customer retention. Journal of

Marketing, 69, 210-218.

Guttman, L. (1954). An outline of some new methodology for social research. Public Opinion

Quarterly, 18, 395-404.

Hausknecht, D.R. (1990). Measurement scales in consumer satisfaction/dissatisfaction.

Journal of Consumer Satisfaction, Dissatisfaction and Complaining Behavior, 3, 1-11.

Hays, W.L. (1988). Statistics (4th ed.). New York: Holt, Rinehart and Winston, Inc..

Heiser, W.J. (2006). Measurement without copper instruments and experiment without

complete control. Psychometrika, 71, 457-461.

Hennig-Thurau, T., Gwinner, K.P., & Gremler, D.D. (2002). Understanding relationship

marketing outcomes: An integration of relational benefits and relationship quality.

Journal of Service Research, 4, 230-247.

Herzberg, F., Mausner, B., & Snyderman, B.B. (1959). The motivaton to work. New York:

Wiley.

Howard, J.A., & Sheth, J.N. (1969). The theory of buyer behavior. New York: John Wiley

and Sons.

Homburg, C., Koschate, N., & Hoyer, W.D. (2005). Do satisfied customers really pay more?

A study of the relationship between customer satisfaction and willingness to pay.

Hox, J.J. (1997). From theoretical concept to survey question. In L. Lyberg, P. Biemer, M.

Collins, E. De Leeuw, C. Dippo, N. Schwarz, & D. Trewin (Eds.), Survey measurement

and process quality (pp. 47-70). New York: Wiley.

Hox, J.J. (1998). Er is nieuws onder de zon: Nieuwe oplossingen voor oude problemen.

Kwantitatieve Methoden, 19, 95-118.

Ittner, C.D., & Larcker, D.F. (1998). Are nonfinancial measures leading indicators of

financial performance? An analysis of customer satisfaction. Journal of Accounting

Research, 36, 1-35.

Jack, A., B. (1967). Sampling from a Pareto distribution. Metroeconomica, 19, 216-223.

Jackson, D.N., & Messick, S. (1958). Content and style in personality assessment.

Psychological Bulletin, 55, 243-252.

Jackson, D.N. (1971). The dynamics of structured personality tests: 1971. Psychological

Review, 78, 229-248.

Jackson, D.N. (1973). Structural personality assessment. In B.B. Wolman (Ed.), Handbook of

general psychology (pp. 775-792). NJ: Prentice Hall.

Jacoby, J. (1976). Consumer research: Telling it like it is. In B.B. Anderson (Ed.), Advances

in Consumer Research, 3, 1-11.

Jansen, B.R.J., & Van der Maas, H. (1997). Statistical tests of the rule assessment

methodology by latent class analysis. Developmental Review, 17, 321-357.

Johnson, M.D., Gustafsson, A., Andreassen, T.W., Lervik, L., & Cha, J. (2001). The

evolution and future of national customer satisfaction index models. Journal of

Economic Psychology, 22, 217-245.

Johnston, R. (1995). The determinants of service quality: Satisfiers and dissatisfiers.

International Journal of Service Industry Management, 6, 53-71.

Kackar, R.N. (1989). Taguchi’s quality philosophy: Analysis and commentary. In K. Dehnad

(Ed.), Quality control, robust design, and the Taguchi method (pp. 3-19). Pacific Grove:

Wadsworth and Brooks/Cole.

Kane, M. (2006). In praise of pluralism. A comment on Borsboom. Psychometrika, 71, 441-

Kelley, T.L. (1927). Interpretation of educational measurements. New York: World Book

Company.

Knowles, E.S., & Nathan, K.T. (1997). Acquiescent responding in self reports: Cognitive

style or social concern. Journal of Research in Personality, 31, 293-301.

Krosnick, J.A. (1991). Response strategies for coping with the cognitive demands of attitude

measures in surveys. Applied Cognitive Psychology, 5, 213-236.

Krosnick, J.A., & Fabrigar, L.R. (1997). Designing rating scales for effective measurement in

surveys. In L. Lyberg, P. Biemer, M. Collins, E. De Leeuw, C. Dippo, N. Schwarz, & D.

Trewin (Eds.), Survey measurement and process quality (pp. 141-164). New York:

Wiley.

Krosnick, J.A. (1999). Survey research. Annual Review of Psychology, 50, 537-567.

Lehmann, D.R. (1999). Consumer behaviour and Y2K. Journal of Marketing, 63, 14-18.

Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 140,

44-53.

Little, R.J.A., & Rubin, D.B. (2002). Statistical analysis with missing data (2nd ed.). New

York: Wiley.

Loevinger, J. (1948). The technique of homogeneous tests compared with some aspects of

‘scale analysis’ and factor analysis. Psychological Bulletin, 45, 507-530.

Lord, F.M., & Novick, M.R., (1968). Statistical theories of mental test scores. Reading:

Addison Wesley.

Luo, X., & Homburg, C. (2007). Neglected outcomes of customer satisfaction. Journal of

Marketing, 71, 133-149.

Mahalanobis, P.C. (1936). On the generalized distance in statistics. Proceedings of the

National Institute of Science of India, 12, 49-55.

Mano, H., & Oliver, R.L. (1993). Assessing the dimensionality and structure of the

consumption experience: Evaluation, feeling, and satisfaction. Journal of Consumer

Research, 20, 451-466.

Maxwell, S.E., & Delaney, H.D. (1990). Designing experiments and analyzing data: A model

comparison perspective. Belmont, CA: Wadsworth Publishing Company.

Medlin, C.J., & Quester, P.G. (2002). Inter-firm trust: Two theoretical dimensions versus a

global measure. Paper presented at the IMP conference in Perth, Australia.

www.impgroup.org/uploads/papers/4247.pdf.

Mellenbergh, G.J. (1985). Vraagonzuiverheid: Detectie, definitie en onderzoek. Nederlands

Tijdschrift voor de Psychologie, 40, 425-435.

Messick, S. (1988). The once and future issues of validity: Assessing the meaning and

consequences of measurement. In H. Wainer & H.I. Braun (Eds.), Test validity (pp. 33-

45). Hillsdale, NJ: Lawrence Erlbaum Associates.

Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 13-

103). New York: Macmillan Publishing Co.

Messick, S. (1991). Psychology and methodology of response styles. In R.E. Snow & D.E.

Wiley (Eds.), Improving inquiry in social science: A volume in honor of Lee J. Cronbach

(pp. 161-200). Hillsdale, NJ: Lawrence Erlbaum Associates.

Messick, S. (1995). Validity of psychological assessment: Validation of inferences from

persons’ responses and performances as scientific inquiry into score meaning. American

Psychologist, 50, 741-749.

Mittal, V., & Kamakura, W.A. (2001). Satisfaction, repurchase intent, and repurchase

behavior: Investigating the moderating effects of customer characteristics. Journal of

Marketing Research, 38, 131-142.

Molenaar, I.W. (1995). Some background for item response theory and the Rasch model. In

I.W. Molenaar & G.H. Fischer (Eds.), Rasch models: Foundations, recent developments

and applications (pp. 3-14). New York: Springer-Verlag.

Molenaar, I.W., & Sijtsma, K. (2000). MSP5 for windows: User’s manual. Groningen:

ProGAMMA.

Mokken, R.J. (1971). A theory and procedure of scale analysis. The Hague: Mouton; Berlin:

De Gruyter.

Morgan, R.M., & Hunt, S.D. (1994). The commitment-trust theory of relationship marketing.

Mulhern, F.J. (1999). Customer profitability analysis: Measurement, concentration, and

research directions. Journal of Interactive Marketing, 13, 25-40.

Murphy, K.R., & Davidshofer, C.O. (1991). Psychological testing: Principles and

applications (2nd ed.). Englewood Cliffs, NJ: Prentice-Hall.

Newman, K. (2001). Interrogating SERVQUAL: A critical assessment of service quality

measurement in a high street retail bank. International Journal of Bank Marketing, 19,

126-139.

Niraj, R., Gupta, M., & Narasimhan, C. (2001). Customer profitability in a supply chain.

Nunnally, J.C. (1978). Psychometric theory (2nd ed.). New York: McGraw Hill.

Oliver, R.L. (1980). A cognitive model of the antecedents and consequences of satisfaction

decisions. Journal of Marketing Research, 17, 460-469.

Oliver, R.L., & DeSarbo, W.S. (1988). Response determinants in satisfaction judgments.

Journal of Consumer Research, 14, 495-507.

Oliver, R.L., & Swan, J.E. (1989). Consumer perceptions of interpersonal equity and

satisfaction in transactions: A field survey approach. Journal of Marketing, 53, 21-35.

Oliver, R.L. (1993). Cognitive, affective, and attribute bases of the satisfaction response.

Journal of Consumer Research, 20, 418-430.

Oliver, R.L. (1997). Satisfaction: A behavioral perspective on the consumer. New York:

McGraw Hill.

Oliver, R.L. (1999). Whence consumer loyalty? Journal of Marketing, 63, 33-44.

Oliver, R.L., & Burke, R.R. (1999). Expectation processes in satisfaction formation. Journal

of Service Research, 1, 196-214.

Oort, F.J. (1996). Using restricted factor analysis in test construction. Doctoral dissertation,

University of Amsterdam, Amsterdam.

Oosterveld, P. (1996). Questionnaire design methods. Doctoral dissertation, University of

Amsterdam, Amsterdam.

Parasuraman, A., Zeithaml, V.A., & Berry, L. L. (1985). A conceptual model of service

quality and its implications for future research. Journal of Marketing, 49, 41-50.

Parasuraman, A., Zeithaml, V.A., & Berry, L. L. (1988). SERVQUAL: A multiple-item scale

for measuring consumer perceptions of service quality. Journal of Retailing, 64, 12-40.

Parasuraman, A., Zeithaml, V.A., & Berry, L. L. (1994). Reassessment of expectations as a

comparison standard in measuring service quality: Implications for future research.

Paulhus, D.L. (1991). Measurement and control of response bias. In J.P. Robinson, P.R.

Shaver, & L.S. Wrightsman (Eds.), Measures of personality and social psychological

attitudes (pp. 17-59). San Diego, CA: Academic Press Inc..

Peter, J.P. (1981). Construct validity: A review of basic issues and marketing practices.

Journal of Marketing Research, 18, 133-145.

Peterson, R.A., & Wilson, W.R. (1992). Measuring customer satisfaction: Fact and artefact.

Journal of the Academy of Marketing Science, 20, 61-71.

Pfeifer, P.E., Haskins, M.E., & Conroy, R.M. (2005). Customer lifetime value, customer

profitability, and the treatment of acquisition spending. Journal of Managerial Issues,

17, 11-25.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests.

Copenhagen: Danish Institute for Educational Research.

Reichheld, F.F., & Sasser, W.E. (1990). Zero defections: Quality comes to service. Harvard

Business Review, 68, 105-111.

Reichheld, F.F. (2006). The ultimate question: Driving good profits and true growth.

Cambridge: Harvard Business School Press.

Rotter, J. (1967). A new scale for the measurement of interpersonal trust. Journal of

Personality, 35, 651-665.

Rugg, D. (1941). Experiments in wording questions: II. Public Opinion Quarterly, 5, 91-92.

Russell, J.A., & Carroll, J.M. (1999a). On the bipolarity of positive and negative affect.

Psychological Bulletin, 125, 3-30.

Russell, J.A., & Carroll, J.M. (1999b). The phoenix of bipolarity: Reply to Watson and

Tellegen (1999). Psychological Bulletin, 125, 611-617.

Rust, R.T., & Zahorik, A.J. (1993). Customer satisfaction, customer retention and market

share. Journal of Retailing, 69, 193-21.

Saris, W.E., Van Wijk, T., & Scherpenzeel, A. (1998). Validity and reliability of subjective

social indicators: The effect of different measures of association. Social Indicators

Research, 45, 173-199.

Sartori, G. (1984). Guidelines for concept analysis. In G. Sartori (Ed.), Social science

concepts: A systematic analysis (pp. 15-85). Beverly Hills, CA: Sage.

Schafer, J.L., & Graham, J.W. (2002). Missing data: Our view of the state of the art.

Psychological Methods, 7, 147-177.

Scherpenzeel, A.C. (1995). A question of quality: Evaluating survey questions by multitrait-

multimethod studies. Doctoral dissertation, University of Amsterdam, Amsterdam.

Schouwstra, S.J. (2000). On testing plausible threats to construct validity. Doctoral

dissertation, University of Amsterdam, Amsterdam.

Schuman, H. & Presser, S. (1981). Questions and answers in attitude surveys: Experiments on

question form, wording, and context. New York: Academic Press Inc..

Sheatsley, P.B. (1983). Questionnaire construction and item writing. In P.H. Rossi, J.D.

Wright, & A.B. Anderson (Eds.), Handbook of survey research (pp. 195-230). New

York: Academic Press Inc..

Sijtsma, K., & Molenaar, I.W. (2002). Introduction to nonparametric item response theory.

Thousand Oaks: Sage.

Sijtsma, K., & Van der Ark, L.A. (2003). Investigation and treatment of missing item scores

in test and questionnaire data. Multivariate Behavioral Research, 38, 503-528.

Sijtsma, K. (2006). Psychometrics in psychological research: Role model or partner in

science? Psychometrika, 71, 451-455.

Sijtsma, K., Emons, W.H.M., Bouwmeester, S., Nyklicek, I., & Roorda, L.D. (2008).

Nonparametric IRT analysis of quality-of-life scales and its application to the world

health organization quality-of-life scale (WHOQOL-Bref). Quality of Life Research, 17,

275-290.

Singh, J., & Sirdeshmukh, D. (2000). Agency and trust mechanisms in consumer satisfaction

and loyalty judgments. Journal of the Academy of Marketing Science, 28, 150-167.

Soliman, H.M. (1970). Motivation-hygiene theory of job attitudes: An empirical investigation

and an attempt to reconcile both the one- and the two-factor theories of job attitudes.

Journal of Applied Psychology, 54, 452-461.

Stouthard, M.E.A., Mellenbergh, G.J., & Hoogstraten, J. (1993). Assessment of dental

anxiety: A facet approach. Anxiety, Stress, and Coping, 6, 89-105.

Sudman, S., & Bradburn, N.M. (1982). Asking questions: A practical guide to questionnaire

design. San Francisco: Jossey-Bass.

Tabachnick, B.G., & Fidell, L.S. (2007). Using multivariate statistics (5th edition). Boston:

Pearson Education Inc..

Terpstra, M.J. & Van Gastel, W. (2004). Inventory of customer satisfaction surveys.

Unpublished report, ING Group, Amsterdam.

Terpstra, M.J. (2005). Customer satisfaction, customer loyalty and customer profitability.

Terpstra, M.J. (2006a). Customer satisfaction, customer loyalty, and recommendation

intentions. Unpublished report, ING Group, Amsterdam.

Terpstra, M.J. (2006b). Business facts for ING Retail Netherlands. Unpublished report, ING

Group, Amsterdam.

Terpstra, M.J. (2008). A model for developing customer satisfaction business cases.

Thomson, G. (1961). The inspiration of science. London: Oxford University Press.

Thorndike, E.L. (1920). A constant error in psychological ratings. Journal of Applied

Psychology, 4, 25-29.

Torgerson, W.S. (1958). Theory and methods of scaling. New York: John Wiley and Sons.

Tse, D.K., & Wilton, P.C. (1988). Models of consumer satisfaction: An extension. Journal of

Marketing Research, 25, 204-212.

Van der Ark, L.A. (2005). Stochastic ordering of the latent trait by the sum score under

various polytomous IRT models. Psychometrika, 70, 283-304.

Van Dolen, W., Lemmink, J., Mattsson, J., & Rhoen, I. (2001). Affective consumer responses

in service encounters: The emotional content in narratives of critical incidents. Journal

of Economic Psychology, 22, 359-376.

Van Herk, H. (2000). Equivalence in a cross-national context: Methodological & empirical

issues in marketing research. Doctoral dissertation, University of Tilburg, Tilburg.

Van Montfort, K., Masurel, E., & Van Rijn, I. (2000). Service satisfaction: An empirical

analysis of consumer satisfaction in financial services. The Service Industries Journal,

20, 80-94.

Van Ginkel, J. R. (2007). Multiple imputation for incomplete test, questionnaire, and survey

data. Doctoral dissertation, University of Tilburg, Tilburg.

Van Ginkel, J.R., Van der Ark, L.A., & Sijtsma, K. (2007). Multiple imputation of item

scores in test and questionnaire data, and influence on psychometric results. Multivariate

Behavioral Research, 42, 387-414.

Verhoef, P.C. (2001). Analysing customer relationships: Linking relational constructs and

marketing instruments to customer behavior. Doctoral dissertation, Erasmus University,

Rotterdam.

Westbrook, R.A., & Oliver, R.L. (1981). Developing better measures of consumer

satisfaction: Some preliminary results. In K.B. Monroe (Ed.), Advances in consumer

research (8th ed.) (pp. 94-99). MI: Association for Consumer Research.

Westbrook, R.A., & Oliver, R.L. (1991). The dimensionality of consumption emotion patterns

and consumer satisfaction. Journal of Consumer Research, 18, 84-91.

Wirtz, J., & Bateson, J.E.G. (1995). An experimental investigation of halo effects in

satisfaction measures of service attributes. International Journal of Service Industry

Management, 6, 84-102.

Wirtz, J. (2000). An examination of the presence, magnitude and impact of halo on consumer

satisfaction measures. Journal of Retailing and Consumer Services, 7, 89-99.

Wirtz, J., & Lee, M.C. (2003). An examination of the quality and context-specific

applicability of commonly customer satisfaction measures. Journal of Service Research,

5, 345-355.

Wittgenstein, L. (1953). Philosophische untersuchungen/Philosophical investigations. In M.

Derksen (2002). Filosofische onderzoekingen. Amsterdam: Boom.

Wittgenstein, L. (1958). The blue and brown books. In W. Oranje (1996). Het blauwe en het

bruine boek. Amsterdam: Boom.

Wolf, M.G. (1970). Need gratification theory: A theoretical reformulation of job

satisfaction/dissatisfaction and job motivation. Journal of Applied Psychology, 54, 87-

Woodall, T. (2001). Six sigma and service quality: Christian Grönroos revisited. Journal of

Marketing Management, 17, 595-607.

Yi, Y. (1990). A critical review of consumer satisfaction. In V.A. Zeithaml (Ed.), Review of

marketing (pp. 68-123). Chicago: American Marketing Association.

Zeithaml, V.A., & Bitner, M.J. (1996). Services marketing. New York: McGraw Hill.

Zeithaml, V.A., Parasuraman, A., Berry, L.L. (1990). Delivering quality service. New York:

The Free Press.

Samenvatting

Dit proefschrift gaat over de meting van tevredenheid van klanten in de sector van de

financiële dienstverlening door banken. Klanttevredenheid is een onderwerp van

maatschappelijk en economisch belang. Dit komt ook tot uiting in de omvangrijke

academische literatuur over dit onderwerp. Het blijkt dat tevredenheid zich moeilijk laat

definiëren en meten (Oliver, 1997, blz. 13). Dit rechtvaardigt nader onderzoek naar de

betekenis en de meting van tevredenheid.

Psychologische eigenschappen zoals tevredenheid zijn theoretische constructen, en

worden afgeleid uit het gedrag van personen. In marketingonderzoek worden psychologische

eigenschappen veelal gemeten door middel van vragenlijsten. Vaak gebruikt men in het

marketingonderzoek voor de meting van dit soort eigenschappen slechts een enkele vraag,

maar uit de psychometrie is bekend dat een enkele vraag de eigenschap onvolledig dekt

(Messick, 1989, blz. 14). Verder hanteren verschillende marketingstudies verschillende

definities en operationaliseringen van bepaalde eigenschappen. Deze factoren hinderen de

interpretatie en vergelijkbaarheid van resultaten van verschillende studies.

Hoofdstuk 1 behandelt de belangrijkste problemen in klanttevredenheidsonderzoek. Dit

zijn het ontbreken van een goed uitgewerkte definitie van klanttevredenheid, de gebrekkige

validiteit van metingen van klanttevredenheid, en het gebrek aan kennis over de invloed van

klanttevredenheid op klantrendement. Deze problemen hangen onderling samen, omdat het

ontbreken van een goed uitgewerkte definitie van tevredenheid het meten van tevredenheid

hindert, en omdat het ontbreken van valide metingen van tevredenheid de analyse van de

invloed van tevredenheid op klantrendement hindert. Dit proefschrift beoogt bij te dragen aan

de oplossing van deze problemen, en in het verlengde daarvan aan de wetenschappelijke

theorie over klanttevredenheid en de methodologie van klanttevredenheidsonderzoek.

De eerste studie in dit proefschrift gaat over theoretische kenmerken van psychologische

eigenschappen en meetprocedures voor psychologische eigenschappen. Psychologische

eigenschappen zijn theoretische constructen. Psychologische constructen zoals tevredenheid

hebben een bepaalde linguïstische en empirische betekenis. De linguïstische betekenis van

tevredenheid is het gebruik van de term tevredenheid in de alledaagse en wetenschappelijke

taal, en kan worden beschreven in een definitie van tevredenheid. De empirische betekenis

van tevredenheid betreft de gedragingen die worden geassocieerd met tevredenheid, en vormt

de basis voor metingen van tevredenheid. De meetprocedures voor psychologische

eigenschappen zijn procedures voor het gebruik van psychologische meetinstrumenten, zoals

psychologische testen en psychologische vragenlijsten, de constructie van schalen voor de

meting van eigenschappen, en het scoren van personen op de schalen.

Hoofdstuk 2 behandelt de definitie van psychologische eigenschappen, de ontwikkeling

van meetinstrumenten voor psychologische eigenschappen, het proces van het meten van

psychologische eigenschappen, de constructie van schalen, en de kwaliteit van meetwaarden.

Het hoofdstuk besluit met een discussie over verschillende opvattingen van

constructvaliditeit. In navolging van Messick (1989, 1995) vatten we constructvaliditeit op als

de passing van interpretaties van schaalscores in termen van het te meten construct. Deze

opvatting van constructvaliditeit vormde de aanleiding het deductive design te kiezen voor de

validatie van de metingen van tevredenheid.

De tweede studie betrof het gebruik van de eigenschappen tevredenheid en

ontevredenheid in de literatuur. Hoofdstuk 3 geeft een overzicht van de belangrijkste

definities en theorieën van deze eigenschappen in de marketing literatuur. Vastgesteld werd

dat tevredenheid en ontevredenheid worden gebruikt om bepaalde gevoelens en oordelen van

consumenten te beschrijven. Deze gevoelens en oordelen vormen een respons op ervaringen

van de klant met bijvoorbeeld een product, en verder heeft de respons betrekking op dit

product, en drukt hij een evaluatie van het product uit. Tevredenheid/ontevredenheid met een

bank werd gedefinieerd als de evaluatieve respons van de klant, die is gericht op de bank, en

die wordt veroorzaakt door het geheel van ervaringen van de klant met de bank. De positieve

respons drukt tevredenheid uit, en de negatieve respons drukt ontevredenheid uit. Tot slot

werd vastgesteld dat de bestaande vragenlijsten voor klanttevredenheid nauwelijks geschikt

zijn voor het meten van tevredenheid met een bank.

In hoofdstuk 4 wordt het deductive design (Schouwstra, 2000) voor de ontwikkeling van

psychologische vragenlijsten behandeld. Het deductive design werd gebruikt voor de

ontwikkeling van een psychologische vragenlijst voor klanttevredenheid over BANK, de

formulering van richtlijnen voor de afname van de vragenlijst, de specificatie van het

meetmodel voor de constructie van schalen, en de formulering van hypotheses over

eigenschappen van de schaalscores. De vragenlijst bestond uit negen gesloten vragen over

aspecten van tevredenheid/ontevredenheid over BANK. Het model van monotone

homogeniteit (Mokken, 1971) werd gebruikt om de schaalbaarheid van deze items te

onderzoeken. De hypotheses hadden betrekking op de eigenschappen van de schaalscores,

zoals de zuiverheid ervan en de relatie met metingen van andere constructen. De passing van

het meetmodel alsmede de hypotheses werden onderzocht in twee empirische studies.

De derde studie was een empirisch onderzoek naar klanttevredenheid over BANK. Dit

was het eerste empirische onderzoek. De doelen van het onderzoek waren de constructie van

een schaal voor klanttevredenheid, het beoordelen van de passing van de interpretatie van

schaalscores als meetwaarden voor klanttevredenheid over BANK, en het onderzoeken van de

invloed van klanttevredenheid op het klantrendement. Hoofdstuk 5 beschrijft de methode van

het onderzoek. De vragenlijst voor klanttevredenheid werd afgenomen in een steekproef van

3600 klanten van BANK, hetgeen 1689 respondenten opleverde. Ook werden in datzelfde

onderzoek de eigenschappen vertrouwen, kwaliteit, en loyaliteit gemeten. Het databestand

werd verrijkt met gegevens over het klantrendement op het tijdstip van het onderzoek, na

verloop van één jaar, en na verloop van twee jaar.

De resultaten van het eerste empirische onderzoek worden gerapporteerd in hoofdstuk 6.

Volgens het model van monotone homogeniteit wordt klanttevredenheid gemeten op een

eendimensionele schaal. Daarmee werd een opvatting uit de literatuur weerlegd die zegt dat

tevredenheid en ontevredenheid twee aparte dimensies representeren. De toetsen van de

hypotheses over de kenmerken van de schaalscores bevestigden de interpretatie van de

schaalscores als meetwaarden voor klanttevredenheid over BANK. Uit de toets van de

hypothese over de relatie tussen kwaliteit en klanttevredenheid bleek een sterke relatie tussen

de afwezigheid van problemen met BANK en tevredenheid over BANK. Dit resultaat

bevestigt het belang van proceskwaliteit voor klanttevredenheid. Tot slot werden positieve

effecten van klanttevredenheid op het klantrendement na verloop van respectievelijk één jaar

en twee jaar gevonden. Dit resultaat geeft aan dat tevredenheid van invloed is op

klantrendement.

De vierde studie was een empirisch onderzoek naar klanttevredenheid met BANK. Dit

was het tweede empirische onderzoek. Het doel van dit onderzoek was vast te stellen of de

schaalscores voor klanttevredenheid werden beïnvloed door responsstijlen, zoals een

algemene voorkeur voor de middelste antwoordcategorie van items of de extreme

antwoordcategorieën. Voor dit onderzoek werd de vragenlijst voor klanttevredenheid

afgenomen in een steekproef van bijna 3000 klanten van BANK, hetgeen 1227 respondenten

opleverde. Om de responsstijlen te meten werden ook gegevens verzameld over bijvoorbeeld

de verwachtingen van de klant over de ontwikkeling van de Nederlandse economie.

Hoofdstuk 7 beschrijft de methode van het onderzoek.

De resultaten van het tweede empirische onderzoek worden gerapporteerd in hoofdstuk

8. Uit de resultaten bleek dat de schaalscores voor klanttevredenheid enigzins vertekend

werden door responsstijlen. Derhalve kan niet worden uitgesloten dat responsstijlen ook de

schaalscores voor klanttevredenheid in het eerste empirische studie in lichte mate hebben

vertekend. Daarom wordt geadviseerd om bij gebruik van de vragenlijst in vervolgonderzoek

maatregelen te nemen ter correctie van de invloed van deze responsstijlen.

Hoofdstuk 9 betreft de algemene discussie. Geconcludeerd werd dat tevredenheid met

een bank zich manifesteert in emoties, spijt, verwachtingen, disconfirmatie, en rationele

oordelen. Dit is een nuttig resultaat voor wetenschappelijke theorievorming over

klanttevredenheid en voor klanttevredenheidsmanagement in de financiële dienstverlening.

Het verklaart bijvoorbeeld waarom klanttevredenheid niet uitsluitend wordt gedreven door

technische kwaliteit van wat een bedrijf levert, maar ook door functionele kwaliteit, dus hoe

een bedrijf zijn diensten levert, de communicatie met de klant, en reputatie van het bedrijf.

Verder levert het onderzoek ondersteuning voor de theorie over de invloed van

klanttevredenheid op klantbaten. Dit is een nuttig resultaat voor wetenschappelijke

theorievorming en voor strategie ontwikkeling in de financiële dienstverlening. Het gebruik

van moderne psychometrische methoden heeft bijgedragen aan ontwikkeling van een

meetinstrument voor klanttevredenheid met banken en de vaststelling van de validiteit van de

metingen van klanttevredenheid. Dit is een nuttig resultaat voor de methodologie van

wetenschappelijk en toegepast klanttevredenheidsonderzoek.

Appendix 1

Vragenlijst onderzoek 1

Vraag 0 Beschouwt u BANK als uw belangrijkste bank? Ja……. Nee….. Vraag 1. Welke financiële producten heeft u op dit moment bij BANK? Er zijn meerdere antwoorden mogelijk. Ja a Betaalrekening……………………………………... b Betaalpas…………………………………………… c Credit card…………………………………………. d Internetbankieren…………………………………... e Spaarproducten…………………………………….. f Beleggingsproducten………………………………. g Hypotheek………………………………………….. h Kredieten, leningen (voor consumptief gebruik)…... i Schadeverzekeringen………………………………. j Levensverzekeringen………………………………. Vraag 2. Via welk kanaal of kanalen heeft u in het afgelopen jaar contact met BANK gehad? Er zijn meerdere antwoorden mogelijk. Ja a Medewerker kantoor……………………………... b Adviseur aan huis…………………………………. c Telefoon1…….…………………………………... d Telefoon2….…………………………………….. e Correspondentie………………………………….. f E-mail……………………………………………. g Internet..………………………………………….. h Internetbankieren1………………………….…... i Internetbankieren2..………………………………. j Anders, namelijk ….………………………………. k Geen……………………………………………….

Vraag 3. (stellingen roteren) In dit blok staat een aantal stellingen over BANK. Wilt u van iedere stelling aangeven in hoeverre u het met de stelling eens dan wel oneens bent? <5 antwoordcategorieen, van ‘zeer mee eens’ tot en met ‘zeer mee oneens’, en 1 categorie ‘geen antwoord’> A Ik voel me thuis bij BANK | B Ik ben tevreden over BANK | C Nvt | D Er zijn goede redenen om weg te gaan bij BANK | E IIk heb gemengde gevoelens over BANK | F Nvt | G BANK voldoet aan alle eisen die ik aan een bank stel | H Nvt | I Nvt |

Vraag 4. (stellingen roteren) Als u eens terugdenkt aan het afgelopen jaar, en in het bijzonder aan uw ervaringen met BANK, Wilt u nu van iedere stelling aangeven in hoeverre u het met de stelling eens dan wel oneens bent? <5 antwoordcategorieen, van ‘zeer mee eens’ tot en met ‘zeer mee oneens’, en 1 categorie ‘geen antwoord’> A Ik had afgelopen jaar een prettige relatie met BANK | B BANK heeft aan mijn verwachtingen voldaan | C Ik heb spijt gehad van mijn keuze voor BANK | D Ik had afgelopen jaar problemen met BANK |

Vraag 5. (stellingen roteren) Er volgt nu een aantal stellingen over het vertrouwen in de dienstverlening van BANK. Wilt u van iedere stelling aangeven in hoeverre u het met de stelling eens dan wel oneens bent? <5 antwoordcategorieen, van ‘zeer mee eens’ tot en met ‘zeer mee oneens’, en 1 antwoordcategorie ‘geen antwoord’> A Ik kan er op rekenen dat BANK mij eerlijk behandelt | B Ik kan er op rekenen dat BANK mijn zaken correct afhandelt | C Ik kan er op vertrouwen dat BANK beloftes en afspraken nakomt | D Ik twijfel soms aan de kwaliteiten van BANK | E Ik twijfel soms aan de goede wil van BANK | F Ik kan BANK vertrouwen | G Bij BANK kan ik rekenen op een goede service |

Vraag 6. Er volgt nu een aantal stellingen over problemen met BANK. Kunt u aangeven of u een dergelijk probleem heeft gehad, in het afgelopen jaar? Er zijn meerdere antwoorden mogelijk. Ja A Fouten in de afhandeling van uw bankzaken B Fouten in de verwerking van uw opdrachten C Onvoldoende informatie over uw bankzaken D Onduidelijke informatie over uw bankzaken E Onredelijke kosten voor het gebruik van diensten F Trage dienstverlening G Trage overboekingen H Slecht nakomen van afspraken door BANK I Onvoldoende bereikbaarheid via de telefoon J Onvoldoende bereikbaarheid via internet K Onvoldoende bereikbaarheid van kantoren L Slecht beantwoorden van uw vragen M Problemen met passen N Problemen met pinnen O Problemen met internetbankieren P Een ander probleem Q Geen probleem Vraag 7. (stellingen roteren) Er volgt nu een aantal aspecten van de dienstverlening van BANK. Kunt u op grond van uw persoonlijke ervaringen de prestatie van BANK op deze aspecten beoordelen? <4 antwoordcategorieen van ‘uitstekend’ tot en met ‘slecht’, en 1 antwoordcategorie ‘weet niet’> A De juiste verwerking van opdrachten die u geeft | B De snelheid waarmee overboekingen worden verricht | C De snelheid van de dienstverlening door BANK. | D Het nakomen van afspraken en beloftes door BANK | E Het correct afhandelen van uw bankzaken | F De frequentie waarmee u rekeningafschriften ontvangt van BANK | Vraag 8. (stellingen roteren) Er volgt nu een aantal aspecten van producten en diensten van BANK. Kunt u op grond van uw persoonlijke ervaringen de prestatie van BANK op deze aspecten beoordelen? <4 antwoordcategorieen van ‘uitstekend’ tot en met ‘slecht’, en 1 antwoordcategorie ‘weet niet’> A De tarieven van betaalpakketten van BANK | B De gemak van de producten en diensten van BANK | C De duidelijkheid van de informatie die BANK u verstrekt over uw

bankzaken |

D De toereikendheid van informatie die BANK u verstrekt over uw bankzaken

E De kosten die BANK rekent voor het gebruik van diensten | F De rentes van producten van BANK |

Vraag 9. (stellingen roteren) Er volgt nu een aantal aspecten van de dienstverlening via de verschillende kanalen van BANK. Kunt u op grond van uw persoonlijke ervaringen de prestatie van BANK op deze aspecten beoordelen? <4 antwoordcategorieen van ‘uitstekend’ tot en met ‘slecht’, en 1 antwoordcategorie ‘weet niet’> A De dienstverlening via telefoon | B De dienstverlening via internet | C De dienstverlening via het kantoor | D De dienstverlening via post/correspondentie | E Het gemak waarmee u BANK kunt bereiken | F De voorzieningen voor internetbankieren | Vraag 10. (stellingen roteren) Er volgt nu een aantal aspecten van contacten met BANK en medewerkers van BANK. Kunt u op grond van uw persoonlijke ervaringen de prestatie van BANK op deze aspecten beoordelen? <4 antwoordcategorieen van ‘uitstekend’ tot en met ‘slecht’, en 1 antwoordcategorie ‘weet niet’> A De vriendelijkheid van medewerkers van BANK | B De deskundigheid van medewerkers van BANK | C De betrouwbaarheid van medewerkers van BANK | D De mate waarin BANK luistert naar uw wensen en vragen | E De manier waarop BANK u te woord staat | F De manier waarop BANK klachten behandelt | Vraag 11. Met welke banken heeft u verder een relatie? Kunt u per bank aangeven of u hier bankzaken heeft lopen? Met bankzaken doelen wij op alle soorten van bankzaken, zoals betalen, sparen, beleggen, lenen, hypotheken, verzekeren, internetbankieren et cetera. Ja a Bank1 b Bank2 c Bank3 d Bank4 e Bank5 f Bank6 g Andere bank, namelijk…………………………….. h Geen andere bank Vraag 12. (alleen BANK en de banken uit vraag 11) Hoe belangrijk is elk van de volgende banken voor u? <5 antwoordcategorieen, van ‘zeer mee eens’ tot en met ‘zeer mee oneens’, en 1 categorie ‘geen antwoord’> A BANK | B Bank … |

Vraag 13. (alleen BANK en de banken uit vraag 11) Hoe tevreden bent u over de volgende banken? <10 antwoordcategorieen, van ‘bijzonder ontevreden’ tot en met ‘bijzonder tevreden’, en 1 categorie ‘geen antwoord’> A BANK | B Bank … | Vraag 14. (stellingen roteren) Er volgt nu een aantal stellingen over uw houding ten opzichte van BANK, in vergelijking tot andere banken. Kunt u van elke stelling aangeven in hoeverre u het met de stelling eens dan wel oneens ben? <5 antwoordcategorieen, van ‘zeer mee eens’ tot en met ‘zeer mee oneens’, en 1 antwoordcategorie ‘geen antwoord’> A Indien ik nieuwe bankproducten nodig heb, is BANK mijn eerste

keuze |

B Ik heb meer sympathie voor BANK dan voor andere banken | C Voor sommige dingen kan ik het beste terecht bij een andere bank | D Ik overweeg om over te stappen van BANK naar een andere bank | E BANK biedt mij voordelen die andere banken niet bieden | F BANK is al jarenlang mijn belangrijkste bank | Vraag 15. Nvt Vraag 16. Nvt Vraag 17. Hoeveel interesse heeft u voor bankzaken? <5 antwoordcategorieen, van ‘veel interesse’ tot en met ‘geen interesse’, en 1 categorie ‘geen antwoord’> <antwoord> | Vraag 18. Hoeveel interesse heeft u voor nieuwe financiele producten en diensten die banken afnemen? <5 antwoordcategorieen, van ‘veel interesse’ tot en met ‘geen interesse’, en 1 categorie ‘geen antwoord’> <antwoord> | Vraag 19. Nvt

Vraag 20. ( vraag 20b t/m 20e roteren) In dit blok staat een aantal vragen over BANK. Wilt u elk van deze vragen beantwoorden? Vraag 20a Nvt Vraag 20b Hoe tevreden bent u over BANK? Wilt u uw oordeel uitdrukken in een cijfer, waarbij 1 betekent ‘bijzonder ontevreden’ en 10 betekent ‘bijzonder tevreden’. <10 antwoordcategorieen, van ‘bijzonder ontevreden’ tot en met ‘bijzonder tevreden’, en 1 categorie ‘geen antwoord’> cijfer | Vraag 20c Hoe goed voldoet BANK aan uw ideaalbeeld van een bank? Wilt u uw oordeel uitdrukken in een cijfer, waarbij 1 betekent ‘ verre van ideaal’ en 10 betekent ‘ideaal’. <10 antwoordcategorieen, van ‘verre van ideaal’ tot en met ‘ideaal’, en 1 categorie ‘geen antwoord’> cijfer | Vraag 20d Hoe goed heeft BANK, in het afgelopen jaar, aan uw verwachtingen voldaan? Wilt u uw oordeel uitdrukken in een cijfer, waarbij 1 betekent ‘ zeer slecht’ en 10 betekent ‘uitstekend’. <10 antwoordcategorieen, van ‘zeer slecht’ tot en met ‘uitstekend’, en 1 categorie ‘geen antwoord’> cijfer | Vraag 20e Nvt Vraag 20f Nvt

Appendix 2

E-mail bij onderzoek 1

Geachte <aanhef>, Graag nodigen we u uit om aan een vragenlijst van het BANK-Klantenpanel mee te doen. Dit onderzoek gaat over uw tevredenheid over BANK. Over diverse aspecten wordt uw waardering gevraagd. Misschien herkent u enkele vragen die we vorig jaar ook al eens gesteld hebben. Deze vragen herhalen we om beter inzicht te krijgen in hoe klanten over BANK denken in verhouding tot vorig jaar. Met uw deelname helpt u BANK dus om haar dienstverlening beter te laten aansluiten op uw wensen. Sommige vragen in dit onderzoek lijken sterk op elkaar. Wij vragen hiervoor uw begrip. Dit is een bewuste keuze aangezien deze vragenlijst ook een wetenschappelijk doel heeft. BANK wil achterhalen hoe zij klanttevredenheid het beste kan onderzoeken. BANK doet dit in samenwerking met de Universiteit van Tilburg. Als u de vragen beantwoordt zoals u gewend bent, helpt u tevens mee aan de ontwikkeling van ons marktonderzoek. Hoe werkt dit onderzoek? Als u onderstaande link aanklikt komt u vanzelf in de vragenlijst. Het invullen duurt ongeveer 20 minuten. U kunt tot en met 12 oktober aanstaande meedoen aan dit onderzoek.Voor deelneming aan dit onderzoek ontvangt u 10 punten (waarde € 1,=). Deze 10 punten worden binnen 72 uur na het invullen van de vragenlijst aan uw saldo toegevoegd. Na minimaal twee onderzoeken kunt u met uw punten een leuke attentie bestellen of uw punten schenken aan een goed doel: Artsen Zonder Grenzen, Natuurmonumenten of SOS Kinderdorpen. Met uw persoonlijke nummer (UserID) en unieke code (wachtwoord) kunt u inloggen op uw persoonlijke pagina van www.BANK-klantenpanel.nl. Klik op de onderstaande link om de vragenlijst te starten. Hartelijk dank voor uw medewerking aan het BANK-Klantenpanel. Met vriendelijke groet, helpdesk BANK-Klantenpanel www.BANK-klantenpanel.nl

Appendix 3

Vragenlijst onderzoek 2

Vraag 1 Welke financiële producten heeft u op dit moment bij BANK? Er zijn meerdere antwoorden mogelijk. Ja a Betaalrekening……………………………………... b Betaalpas…………………………………………… c Credit card…………………………………………. d Internetbankieren…………………………………... e Spaarproducten…………………………………….. f Beleggingsproducten………………………………. g Hypotheek………………………………………….. h Kredieten, leningen (voor consumptief gebruik)…... i Schadeverzekeringen………………………………. j Levensverzekeringen………………………………. Vraag 2 Via welk kanaal of kanalen heeft u in het afgelopen jaar contact met BANK gehad? Er zijn meerdere antwoorden mogelijk. Ja a Medewerker kantoor….…………………………... b Adviseur aan huis…………………………………. c Telefoon1…….…………………………………... d Telefoon2…….…………………………………….. e Correspondentie………………………………….. f E-mail……………………………………………. g Internet..………………………………………….. h Internetbankieren…………………………….…... i Anders ……….. ….………………………………. j Geen………………………………………………. Vraag 3 (stellingen roteren) In dit blok staat een aantal stellingen over BANK. Wilt u van iedere stelling aangeven in hoeverre u het met de stelling eens dan wel oneens bent? <5 antwoordcategorieen, van ‘zeer mee eens’ tot en met ‘zeer mee oneens’, en 1 categorie ‘geen antwoord’> A Ik voel me thuis bij BANK | B Ik ben tevreden over BANK | D Er zijn goede redenen om weg te gaan bij BANK | E Ik heb gemengde gevoelens over BANK | G BANK voldoet aan alle eisen die ik aan een bank stel |

Vraag 4 (stellingen roteren) Als u eens terugdenkt aan het afgelopen jaar, en in het bijzonder aan uw ervaringen met BANK, Wilt u nu van iedere stelling aangeven in hoeverre u het met de stelling eens dan wel oneens bent? <5 antwoordcategorieen, van ‘zeer mee eens’ tot en met ‘zeer mee oneens’, en 1 categorie ‘geen antwoord’> A Ik had afgelopen jaar een prettige relatie met BANK | B BANK heeft aan mijn verwachtingen voldaan | C Ik heb spijt gehad van mijn keuze voor BANK | D | Ik had afgelopen jaar problemen met BANK Vraag 5 Nvt Vraag 6. (stellingen roteren) Er volgt nu een aantal stellingen over uw verwachtingen ten aanzien van de ontwikkeling van uw koopkracht. Wilt u van iedere stelling aangeven in hoeverre u het met de stelling eens dan wel oneens bent? <5 antwoordcategorieen, van ‘zeer mee eens’ tot en met ‘zeer mee oneens’, en 1 categorie ‘geen antwoord’> A Ik verwacht dat mijn koopkracht het komend jaar gaat verbeteren | B Ik verwacht dat mijn koopkracht het komend jaar gaat verslechteren | C Ik verwacht dat mijn koopkracht over 5 jaar beter is dan nu | D | Ik verwacht dat mijn koopkracht over 5 jaar slechter is dan nu Vraag 7. (stellingen roteren) Er volgt nu een aantal stellingen over uw verwachtingen ten aanzien van de economische ontwikkeling van Nederland. Wilt u van iedere stelling aangeven in hoeverre u het met de stelling eens dan wel oneens bent? <5 antwoordcategorieen, van ‘zeer mee eens’ tot en met ‘zeer mee oneens’, en 1 categorie ‘geen antwoord’> A Ik verwacht dat de economie van Nederland het komend jaar gaat

verbeteren |

B Ik verwacht dat de economie van Nederland het komend jaar gaat verslechteren

C Ik verwacht dat de economie van Nederland over 5 jaar beter is dan nu | D | Ik verwacht dat de economie van Nederland over 5 jaar slechter is dan nu

Vraag 8 (stellingen roteren) In dit blok staan zes stellingen over uw houding ten opzichte van bankzaken, zoals betalen, sparen, lenen, hypotheken, verzekeren, beleggen, et cetera. Wilt u van iedere stelling aangeven in hoeverre u het met de stelling eens dan wel oneens bent? <5 antwoordcategorieen, van ‘zeer mee eens’ tot en met ‘zeer mee oneens’, en 1 categorie ‘geen antwoord’> A Ik maak me nooit druk over bankzaken | B Ik vind bankzaken erg belangrijk | C Het goed regelen van bankzaken maakt het leven gemakkelijker | D Ik vind bankzaken vervelend | E Bankzaken laten mij koud | F Het goed regelen van bankzaken kan veel geld opleveren | Vraag 9. (stellingen roteren) In dit blok staan vier stellingen over de transparantie van de financiële markt. Wilt u van iedere stelling aangeven in hoeverre u het met de stelling eens dan wel oneens bent? <5 antwoordcategorieen, van ‘zeer mee eens’ tot en met ‘zeer mee oneens’, en 1 categorie ‘geen antwoord’> A Ik ken de voor- en nadelen van de banken in de Nederlandse markt | B Ik kan de kwaliteit van BANK moeilijk beoordelen | C Ik kan de kwaliteit van verschillende banken moeilijk vergelijken | D

Ik weet precies wat ik van BANK kan verwachten |

Vraag 10 Nvt

customer satisfaction in retail financial services

Documents

brochure reinventing the customer experience€¦ · sales,...

examining the effect of retail service quality … ·...

foresee mobile satisfaction index · foresee measures...

customer satisfaction in indian retail banking: a grounded

customer satisfaction 2013. customer satisfaction campione...

customer satisfaction in organized retailing - an...

customer satisfaction in the retail market - theseus

loyalty and customer satisfaction in retail banking · pdf...

customer perceptoin & satisfaction survey of organized fmcg...

aalborg universitet customer satisfaction with retail

determinants of customer satisfaction on retail banks...

customer satisfaction in the retail market

reshaping the retail banking experience for the customer ......

customer satisfaction survey of - irs.gov satisfaction,...

customer awareness and satisfaction of islamic retail...

relationship marketing dynamics, customer satisfaction and...

service quality and customer satisfaction: … · service...

banking customer satisfaction: at what price? · customer...

: retailing; customer satisfaction; distribution...

customer satisfaction on the usability of fashion retail …