rater effects in creativity assessment: a mixed methods investigation

13
Thinking Skills and Creativity 15 (2015) 13–25 Contents lists available at ScienceDirect Thinking Skills and Creativity j ourna l h o mepa ge: h t tp://www.elsevier.com/locate/tsc Rater effects in creativity assessment: A mixed methods investigation Haiying Long a,1 , Weiguo Pang b,a Department of Leadership and Professional Studies, Florida International University, 11200 SW 8th Street, Miami, FL 33199, USA b School of Psychology and Cognitive Science, East China Normal University, 3663 North Zhongshan Road, Shanghai 200062, China a r t i c l e i n f o Article history: Received 9 July 2014 Received in revised form 17 October 2014 Accepted 23 October 2014 Available online 18 November 2014 Keywords: Creativity assessment Rater effects Generalizability theory Rater cognition Mixed methods research a b s t r a c t Rater effects in assessment are defined as the idiosyncrasies that exist in rater behaviors and cognitive process. They are composed of two aspects: the analysis of raw rating and rater cognition. This study employed mixed methods research to examine the two aspects of rater effects in creativity assessment that relies on raters’ personal judgment. Quantita- tive data were collected from 2160 raw ratings made by 45 raters in three group and were analyzed by generalizability theory. Qualitative data were collected from raters’ explana- tion of rationales for rating and their answers for questions about rating process as well as from 12 in-depth interviews and both were analyzed by framing analysis. The results indicated that the dependability coefficients were low for all the three rater groups, which were further explained by the variations and inconsistencies in raters’ rating procedure, use of rating scales, and their beliefs about creativity. © 2014 Elsevier Ltd. All rights reserved. 1. Introduction Using human judges to score individual works or behaviors is not an uncommon measurement process in social sciences. Requiring teachers to score responses of constructed items in standardized tests is such an instance (Crisp, 2012). Other examples include counseling psychologists measuring high school students’ degree of pathology and intensity of violence; graduate students in social work program assigning scores to evaluate children’s behaviors at their homes; principals observ- ing classroom teaching and evaluating teachers’ performance. In creativity studies, researchers also rely heavily on raters’ judgment of the products generated from participants, including the ideas produced in divergent thinking tests, creative solutions to real world problems, and artifacts of creative writing and art (Author, 2014b). Research on creativity raters in recent years (e.g., Kaufman & Baer, 2012; Kaufman, Gentile, & Baer, 2005; Kaufman, Baer, & Cole, 2009; Kaufman, Baer, & Gentile, 2004; Kaufman, Baer, Cole, & Sexton, 2008) mostly focused on the influence of raters with different expertise on the results of assessment. However, this line of research does not shed light on the issue of rater effects (Hung, Chen, & Chen, 2012). The present research aims to fill this gap by employing mixed methods methodology to examine the rater effects in assessing the creativity of two science tasks. The examination of this issue is crucial because raters and their judgment are an indispensable part of the assessment. In addition, examining rater effects Corresponding author. Tel.: +86 21 6223 2910. E-mail addresses: haiying.long@fiu.edu (H. Long), [email protected] (W. Pang). 1 Tel.: +1 305 348 3228. http://dx.doi.org/10.1016/j.tsc.2014.10.004 1871-1871/© 2014 Elsevier Ltd. All rights reserved.

Upload: weiguo

Post on 04-Apr-2017

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Rater effects in creativity assessment: A mixed methods investigation

Ri

Ha

b

a

ARRAA

KCRGRM

1

Regijs

Boimc

1

Thinking Skills and Creativity 15 (2015) 13–25

Contents lists available at ScienceDirect

Thinking Skills and Creativity

j ourna l h o mepa ge: h t tp : / /www.e lsev ier .com/ locate / tsc

ater effects in creativity assessment: A mixed methodsnvestigation

aiying Longa,1, Weiguo Pangb,∗

Department of Leadership and Professional Studies, Florida International University, 11200 SW 8th Street, Miami, FL 33199, USASchool of Psychology and Cognitive Science, East China Normal University, 3663 North Zhongshan Road, Shanghai 200062, China

r t i c l e i n f o

rticle history:eceived 9 July 2014eceived in revised form 17 October 2014ccepted 23 October 2014vailable online 18 November 2014

eywords:reativity assessmentater effectseneralizability theoryater cognitionixed methods research

a b s t r a c t

Rater effects in assessment are defined as the idiosyncrasies that exist in rater behaviorsand cognitive process. They are composed of two aspects: the analysis of raw rating andrater cognition. This study employed mixed methods research to examine the two aspectsof rater effects in creativity assessment that relies on raters’ personal judgment. Quantita-tive data were collected from 2160 raw ratings made by 45 raters in three group and wereanalyzed by generalizability theory. Qualitative data were collected from raters’ explana-tion of rationales for rating and their answers for questions about rating process as wellas from 12 in-depth interviews and both were analyzed by framing analysis. The resultsindicated that the dependability coefficients were low for all the three rater groups, whichwere further explained by the variations and inconsistencies in raters’ rating procedure,use of rating scales, and their beliefs about creativity.

© 2014 Elsevier Ltd. All rights reserved.

. Introduction

Using human judges to score individual works or behaviors is not an uncommon measurement process in social sciences.equiring teachers to score responses of constructed items in standardized tests is such an instance (Crisp, 2012). Otherxamples include counseling psychologists measuring high school students’ degree of pathology and intensity of violence;raduate students in social work program assigning scores to evaluate children’s behaviors at their homes; principals observ-ng classroom teaching and evaluating teachers’ performance. In creativity studies, researchers also rely heavily on raters’udgment of the products generated from participants, including the ideas produced in divergent thinking tests, creativeolutions to real world problems, and artifacts of creative writing and art (Author, 2014b).

Research on creativity raters in recent years (e.g., Kaufman & Baer, 2012; Kaufman, Gentile, & Baer, 2005; Kaufman,aer, & Cole, 2009; Kaufman, Baer, & Gentile, 2004; Kaufman, Baer, Cole, & Sexton, 2008) mostly focused on the influencef raters with different expertise on the results of assessment. However, this line of research does not shed light on the

ssue of rater effects (Hung, Chen, & Chen, 2012). The present research aims to fill this gap by employing mixed methods

ethodology to examine the rater effects in assessing the creativity of two science tasks. The examination of this issue isrucial because raters and their judgment are an indispensable part of the assessment. In addition, examining rater effects

∗ Corresponding author. Tel.: +86 21 6223 2910.E-mail addresses: [email protected] (H. Long), [email protected] (W. Pang).

1 Tel.: +1 305 348 3228.

http://dx.doi.org/10.1016/j.tsc.2014.10.004871-1871/© 2014 Elsevier Ltd. All rights reserved.

Page 2: Rater effects in creativity assessment: A mixed methods investigation

14 H. Long, W. Pang / Thinking Skills and Creativity 15 (2015) 13–25

reveals the behaviors and cognitive process of raters during the assessment, which would further facilitate possible trainingsfor raters in the future, hence, help improve the assessment procedure.

2. Literature review

2.1. Rater effects

When human judgment is involved in the measurement, the measurement process becomes more complex than itappears. After the products are made, they are presented before the raters who assign ratings based on the criteria provided.The final decision made about individuals’ traits, then, is not only determined by how individuals perform in the tasks butalso by how raters perform in the assessment process. Traditionally, only the consistency or agreement among raters is thedemonstration of rater performance and most rater trainings focus on how to achieve a high rater agreement. However, sincethe 1970s, researchers began to realize that no matter how many trainings and monitorings that raters go through beforeand during the assessment, their performance is still greatly affected by the idiosyncrasies that exist in their behaviors andcognitive process (Charney, 1984; Hamp-Lyons, 2007; Noyes, 1963). These idiosyncrasies are defined as rater effects (Wolfe,2004).

According to Wolfe and McVay (2012), there are two major aspects of rater effects. One is the manifest level of the effects,which is reflected by the raw ratings assigned by raters. The other is the underlying level, which is shown by raters’ thinkingprocess or rater cognition. These two aspects are closely associated with measurement reliability and validity. On one hand,the raw ratings are a potential source of measurement errors in estimating reliability among raters (Campbell & Fiske, 1959;Cronbach, Rajaratnam, & Gleser, 1963; Shavelson & Webb, 1991). This is as Guilford (1936) stated, “. . . raters are. . .subjectto all the errors to which humankind must plead guilty” (p. 272). On the other hand, raters’ idiosyncrasies interfere withthe construct measured (Cumming, Kantor, & Powers, 2001), thus become a construct-irrelevant variance, which is one ofthe major threats to construct validity (Messick, 1995). Rater cognition is also substantive aspect of validity that focuses onhow judges evaluate works as well as whether judges’ processes are consistent with their interpretation of the construct(Messick, 1995). Its significance in validation process is highlighted in the Standards for Educational and Psychological Testing(AERA, APA, & NCME, 1999),

If the rationale for a test use or score interpretation depends on premises about the psychological processes or cognitiveoperations used by examinees, then theoretical or empirical evidence in support of those premises should be provided.When statements about the processes employed by observers or scorers are part of the argument for validity, similarinformation should be provided. (p. 19)

In addition, under Kane’s (2006) framework of argument-based validation, rater cognition-related factors, such as,whether raters follow rating criteria and whether they use the categories in the intended manner, help establish aninterpretive argument.

Furthermore, the two aspects of rater effects are closely related to each other in a way that the understanding of rawratings, or even the consistency among raters, “depends on an intuitive, if not explicit, understanding of rater cognition”(Bejar, 2012, p. 3). However, as two aspects of rater effects, raw ratings and rater cognition are examined by different researchmethodologies. The ratings are typically analyzed by quantitative methodologies such as, generalizability theory and latenttrait measurement models. Rater cognition is investigated by qualitative methodologies, such as, think aloud and verbalprotocol analysis (Wolfe & McVay, 2012).

2.2. Use of generalizability theory in analysis of raw scores

Under the framework of classical test theory (CCT), the evaluation decision of raters is often expressed as a raw scoreand the consistency among raters is estimated by interrater reliability. In general, there are three categories of interraterreliability: consensus, consistency, and measurement estimates. When two raters do not share common meanings of therating scales but are able to be “consistent in classifying the phenomenon according to his or her own definition of the scale”(Stemler, Consistency Estimates section, para. 1), a situation that resembles creativity assessment, particularly ConsensualAssessment Technique, it is best to use consistency estimate as indicated by Cronbach’s alpha.

However, according to Stemler (2004), Cronbach’s alpha has a few weaknesses. For example, raters may have differentinterpretations in rating scores and rating categories, and it is highly sensitive to the distribution of the data. In addition,even a high alpha does not necessarily suggest a high consensus among raters because a high alpha may result from a largenumber of raters. What’s more, because CCT attributes variation of observed scores only to a true score and a random error,the raw score under this framework cannot reflect variations of raters, such as, rater severity, interactions between ratersand other aspects in the evaluation, and other random errors. For these reasons, Cronbach (2004, p. 394) himself claimed,“Coefficients are a crude device that does not bring to the surface many subtleties implied by variance components” and he

and his colleagues further developed generalizability (G) theory (Cronbach et al., 1963; Shavelson & Webb, 1991).

In G theory, an observed score is assumed to be a sample drawn from a universe of possible observations and each aspectof the measurement is defined as a facet. Each facet involved and the interactions among them are variations of the universescores and the theory aims to more accurately disentangle the contribution of each error to the total variation (Shavelson

Page 3: Rater effects in creativity assessment: A mixed methods investigation

&ow

1mmL2a

(d

2

ftcTcr

trstarftpiarpttcG

aaesmb

3

eqth

H. Long, W. Pang / Thinking Skills and Creativity 15 (2015) 13–25 15

Webb, 1991). As a key component in the measurement process, rater is regarded as one facet in G theory and is a samplef all the possible individuals who are able to assess the artifacts. The sampling variability due to raters is estimated alongith other sources of error in the theory.

A number of studies have used G theory to identify rater effects (e.g., Hill, Charalambous, & Kraft, 2012; Lynch & McNamara,998). In the field of creativity, Silvia et al. (2008) also employed this theory to examine reliability of two subjective ratingethods, “Top Two Creativity” and “Average Creativity”. Recent attention has been focused on using many-facet Rascheasurement to analyze and correct for rater effects (Linacre, 1996; Iramaneerat, Yudkowsky, Myford, & Downing, 2008;

ynch & McNamara, 1998; Sudweeks, Reeve, & Bradshaw, 2005; Smith & Kulikowich, 2004; Wolfe, 2004; Wolfe & McVay,012). Hung et al.’s (2012) study examined rater effects in creativity performance assessment with a many-facet Rasch modelnd they found rater and criterion interaction, suggesting that some raters were more severe in rating specific criteria.

Although studies using many-facet Rasch model revealed important information about rater effects, Lynch and McNamara1998) argued that G theory analyzed by GENOVA software and many-facet Rasch measurement analyzed by FACETS showifferent emphases. They used the microscope as an analogy and stated,

FACETS turns the magnification up quite high and reveals every potential blemish on the measurement surface.GENOVA, on the other hand, sets the magnification lower and tends to show us only the net effect of the blemishesat the aggregated level. This is not to say that ‘turning up the magnification’ is the same as increasing the accuracy. Itmerely suggests that there is a different level of focus (individuals vs. groups). (p. 176)

.3. Rater cognition

Most findings with regard to rater cognition are drawn from research on human judgment, writing assessment, andormative assessment. Sadler (1989) distinguished two approaches that are employed to assign scores to the attributes ofhe end products when human judgments are made. One is analytic approach that relies on a number of predeterminedriteria or categories. The final score of the product is generated from a formula that includes the weights of different criteria.he other is configurational (or holistic, global) approach that assesses the work as a whole and does not need to specify theriteria prior to rating. These two approaches have been widely applied in assessing essays or other writing products andater cognition was found with both approaches (e.g., Cohen & Manion, 1994; Noyes, 1963; Vaughan, 1991).

Along another line of research on rater cognition, researchers identified process of scoring in various assessment con-exts (Bejar, 2012; Crisp, 2012; Lumley, 2002; Suto & Greatorex, 2008). Generally, the process includes reading, rating, andeevaluating. However, rater cognition is found in every step of the process. For instance, scoring starts with reading thecoring instructions, criteria, or the work and it is conceived as an interpretive process. During this process, raters constructheir own meaning of the work, or reconstruct students’ meanings implied in the work, based on raters’ existent knowledgend experience (Crisp, 2012; Cumming, 1990; Huot, 1993; Johnson-Laird, 1983). As a result, raters form a mental responseepresentation of the works they examine as well as a mental rubric, whether or not the criteria or rubrics are providedor them (Bejar, 2012; Crisp, 2008a, 2012). At the time of assigning a score, raters often compare the mental rubrics withhe representation of the works. They also find points of reference or norms in their comparison, such as, the same kind ofroducts that they have seen before or a model work (Laming, 2004). The major challenge that the raters face at this point

s the possible conflicts among their overall impression of the writing product, the specific characteristics of the product,nd the categories of the rating scale. In order to solve this conflict, raters reconcile these aspects. Sometimes, they heavilyely on their intuitive impression of the product, rather than the rating categories. These findings seem to suggest a dualrocessing, which includes two different, but simultaneously active, cognitive systems. One is system 1 thought processhat is automatic and associative. It is also labeled as intuitive and is composed of processes that act quickly but are difficulto explain explicitly. The other is system 2 thought process that is slow and deliberate. It is also labeled as reflective and isomposed of controlled and effortful application of the rules (Kahneman & Frederick, 2002; Stanovich & West, 2002; Suto &reatorex, 2008).

Another growing area in the research of rater cognition is the investigation of the effect of raters’ background on theirssessment performance. Cumming (1990) compared the performance of six expert and seven novice ESL teachers inssessing 12 writing compositions. He found that expert and novice groups were significantly different from each other,specially for the ratings of content and rhetorical organization. What’s more, compared to novice raters, expert ratershowed more self-reflection, paid more attention to key criteria in the text, and made more summaries of their own judg-ent. These findings in writing also resonated with a few studies on creativity assessment that demonstrated low correlations

etween expert and novice raters’ assessment (e.g., Hickey, 2001; Kaufman et al., 2008; Runco, McCarthy, & Svenson, 1994).

. The present study

Based on the literature review on generalizability theory and rater cognition, two points are worth noting. First, Hung

t al.’s (2012) study confirmed the existence of the rater effects in assessing creative products, but it did not address suchuestions as where the rater effects are from and why they exist. This study also indicated that rater effects might be relatedo raters’ different interpretations of the rating scale and the degree of severity in criterion application. However, theseypotheses have seldom been examined in the literature. In addition, raters in this study received thorough training prior
Page 4: Rater effects in creativity assessment: A mixed methods investigation

16 H. Long, W. Pang / Thinking Skills and Creativity 15 (2015) 13–25

to their assessment of the works, which greatly reduced rater effects. Yet, little is known about rater effects if no trainingis provided before the assessment. Second, Long’s (2014) study examined rating criteria in assessing creativity of sciencetasks from raters’ perspective and it provides some insights about rater cognition; however, this study did not specificallyfocus on rater cognition and other aspects of rater cognition, such as, rating procedure, the cognitive process, were notanalyzed. Third, in the creativity assessment that involves human raters, both analytic and holistic approaches have beenused in scoring. Comparatively speaking, the holistic approach, or assessing creativity only relying on raters’ own criteriaor definition of creativity, is more popular (Long, in press). Yet, little is known about the rater effects under this context, oreven in creativity assessment as a whole.

This study focuses on rater effects in assessing creativity of the works generated from two science tasks. It aims to answerthe following research question: how are rater effects displayed in raw ratings and rater cognition? Two sub-questions ofrater cognition are further examined: what rating procedure are raters engaged in? What variations during the assessmentlead to the rater effects? Both quantitative and qualitative approaches are employed in this study, with generalizability anddependability studies analyzing raw ratings and think-aloud method investigating rater cognition.

4. Methodology

4.1. Rationales for using mixed methods

First, two components of rater effects, raw rating and rater cognition, have seldom been examined in one single study, evenin the literature of writing assessment. Second, mixed methods research centers on “breadth and depth of understandingand corroboration” and can provide a fuller picture of the phenomenon of interest (Johnson, Onwuegbuzie, & Turner, 2007,p. 123). In addition, a few researchers in writing and constructed response assessment noted the strength of employingboth quantitative and qualitative approaches in the same study (Myford, 2012; Weigle, 1999), as noted by Weigle (1999)when she employed both quantitative and qualitative approaches to investigate rater and prompt interaction in writingassessment,

Finally, perhaps the most important implication of the study is an illustration of how quantitative and qualitativeanalyses can complement each other. The quantitative results demonstrated that the groups of raters were rating thetwo essay types differently, but the qualitative results were equally important in providing insights into the causes ofthese differences. While applying both methodologies can be time consuming, the payoff in terms of our understandingof the process of writing assessment, and ultimately in improving the validity of our assessments, may be worth theeffort. (p. 171)

Third, a recent review (Long, in press) of research methodologies on five key creativity journals between 2003 and 2012showed that less than 5% of the creativity studies employed mixed methods, indicating a great need for more mixed methodsresearch in the field.

4.2. Participants and procedure

The original data was collected as a part of the project of assessing creative products in science. Two open-ended sciencequestions were created for the purpose of this study. One question was about water and water evaporation (i.e., water task)and the other was about earth tilt and climate change (i.e., earth task). Students had learned the contents of these two topicsbefore they were asked to work on the two tasks. Responses to the questions were collected from 48 sixth grade students.Three groups of judges – educational researchers, elementary school teachers, and education-major undergraduate students– were selected to represent judges of different levels of expertise. Educational researchers are typically experts in educa-tion and creativity, but might not know as much about sixth grade science curriculum and characteristics of elementaryschool students as elementary school teachers. Elementary school teachers might know less about creativity and educationalresearch but are more experienced with students and domain knowledge and they are considered quasi-expert. Compara-tively speaking, education-major undergraduate students are novices in both educational research and elementary schoolsettings. Each judge group consisted of 15 judges and a total of 45 judges were recruited from local schools and university.

Quantitative and part of qualitative data were collected at the same time, while other qualitative data were collected at alater time; therefore, this study was a combination of convergent and sequential mixed methods research (Creswell, 2014).Judges provided scores to all the works based on a 1–5 scale with 1 being the least creative and 5 being the most creative. Theywere also instructed to write down their rationales for assigning a score to a response or their comments if there is any. This

process is similar to verbal protocol or “think-aloud” method that was proved to be valid in assessing individuals’ cognitivethinking (Crisp, 2008b). At the end of the rating, judges were asked to answer a few questions to explain the meaning ofcreativity. Based on the answers provided by the judges, follow-up questions were asked if ambiguity or confusion arose.A total of 12 judges were further conducted semi-structured interviews. All the ratings were analyzed by generalizabilityand dependability studies and all the qualitative data were analyzed by framing analysis. More details about the participantsample, the tasks, the procedure of data collection and analysis are available in Long’s (2014) study.
Page 5: Rater effects in creativity assessment: A mixed methods investigation

H. Long, W. Pang / Thinking Skills and Creativity 15 (2015) 13–25 17

Table 1Descriptive statistics of individual ratings.

Raters Researcher group M (SD) Teacher group M (SD) Under. group M (SD)

1 3.38 (1.18) 3.40 (1.32) 3.10 (1.36)2 2.15 (1.17) 2.54 (1.25) 3.65 (1.18)3 2.60 (1.05) 2.40 (1.11) 3.31 (1.11)4 3.52 (1.20) 3.52 (1.15) 3.19 (1.32)5 3.19 (.82) 2.08 (.98) 2.54 (1.17)6 2.56 (1.01) 2.73 (1.11) 3.23 (1.04)7 3.00 (1.32) 3.31 (1.11) 2.83 (1.17)8 2.73 (1.01) 2.67 (1.28) 3.23 (1.39)9 2.75 (1.21) 3.31 (1.52) 2.60 (1.12)10 3.77 (.99) 3.08 (1.09) 3.04 (1.35)11 3.17 (1.19) 3.17 (1.06) 3.23 (1.49)12 3.15 (1.29) 3.44 (1.18) 2.75 (1.02)13 2.23 (1.08) 3.25 (1.18) 3.25 (1.41)14 3.35 (.93) 3.19 (1.18) 2.73 (1.16)15 2.58 (1.33) 3.46 (1.46) 2.98 (1.08)

Na

5

5

5

agvaatet

TV

N

ote: This table includes mean and standard deviation of the ratings provided by the raters in three groups. The first number in the column is the meannd the number in the parenthesis is the standard deviation. Under. group = undergraduate student group.

. Results

.1. Analysis of raw ratings

.1.1. Generalizability (G) studyThree groups of judges or 45 judges in total assessed 48 students’ responses to two tasks on a 1-5 scale and generated

total of 2,160 ratings in this study (see Table 1 for mean and standard deviation of the ratings provided by raters). In thiseneralizability (G) study, subject (or student) is the object of measurement and rater and task are two facets or sources ofariability or error. Because the two tasks were created specifically for this study, they were viewed as fixed facets and werenalyzed separately. In theory, raters in each group are viewed as randomly selected from an infinite number of raters andre treated as a random facet. Because there were three rater groups, G study was conducted separately for each group and

he results were analyzed by task and by rater group. For this one-facet G study, sources of variance are subject, rater, andrror that includes subject-rater interaction and other random errors. The findings (see Table 2) consistently demonstratedhat the biggest portion of the total variance in all of the situations was associated with the error (i.e., 47–69%). The variance

able 2ariance components and percentage of the variance.

Source of variation Estimated variance component Percentage of total variance

Rscher group Water task Subjects (s) .297 19Raters (r) .180 12Error (s × r, e) 1.072 69Total 1.549 100

Earth task Subjects (s) .493 35Raters (r) .263 18Error (s × r, e) .672 47Total 1.428 100

Teacher group Water task Subjects (s) .315 18Raters (r) .231 13Error (s × r, e) 1.196 69Total 1.742 100

Earth Task Subjects (s) .582 35Raters (r) .210 13Error (s × r; e) .877 53Total 1.669 100

Under. group Water task Subjects (s) .463 28Raters (r) .061 4Error (e) 1.118 68Total 1.642 100

Earth task Subjects (s) .493 32Raters (r) .125 8Error (e) .918 60Total 1.536 100

ote: Rscher group = researcher group; Under. group = undergraduate student group.

Page 6: Rater effects in creativity assessment: A mixed methods investigation

18 H. Long, W. Pang / Thinking Skills and Creativity 15 (2015) 13–25

Table 3Estimated G and � coefficients for each task.

No. of raters Water task Earth task

Rscher group (G/�) Teacher group (G/�) Under group (G/�) Rscher group (G/�) Teacher group (G/�) Under. Group (G/�)

1 .22/.19 .21/.18 .29/.28 .42/.35 .40/.35 .35/.322 .36/.32 .35/.31 .45/.44 .59/.51 .57/.52 .52/.493 .45/.42 .44/.40 .55/.54 .69/.61 .67/.62 .62/.594 .53/.49 .51/.47 .62/.61 .75/.68 .73/.68 .68/.655 .58/.54 .57/.53 .67/.66 .79/.73 .77/.73 .73/.706 .62/.59 .61/.57 .71/.70 .81/.76 .80/.76 .76/.747 .66/.62 .65/.61 .74/.73 .84/.79 .82/.79 .79/.778 .69/.65 .68/.64 .77/.76 .85/.81 .84/.81 .81/.799 .71/.68 .70/.67 .79/.78 .87/.83 .86/.83 .83/.8110 .73/.70 .72/.69 .81/.80 .88/.84 .87/.84 .84/.8311 .75/.72 .74/.71 .82/.81 .89/.85 .88/.85 .86/.8412 .77/.74 .76/.73 .83/.82 .90/.86 .89/.87 .87/.8513 .78/.76 .77/.74 .84/.84 .91/.87 .90/.87 .87/.8614 .80/.77 .79/.76 .85/.85 .91/.88 .90/.88 .88/.8715 .81/.78 .80/.77 .86/.85 .92/.89 .91/.89 .89/.88

Note: All the numbers are rounded to the second decimal. Rscher group = researcher group; Under. group = undergraduate student group.

explained by subject or students’ “true” creativity was small (i.e., 18–35%) compared to the error, whereas the variancerelated to raters accounted for the smallest part of the variance (from 4% to 18%). These suggested that raters showedsome degree of consistency in their assessment, but the larger error variance, including the interaction between raters andstudents, also indicated the existence of large inconsistencies among raters.

Differences were also detected in variance components between ratings of two tasks. Overall, the variance compo-nents for the water task (i.e., 68–69%) represented a larger percentage of the error variance than the earth task (i.e.,47–60%), whereas errors related to subjects (i.e., 18–28%) and raters (i.e., 4–13%) in the water task were slightly smallerthan those in the earth task (i.e., 32–35% for subjects and 8–18% for raters). These indicated that students performed betterin the earth task than in the water task and raters provided more consistent ratings in the earth task than in the watertask.

Interestingly, raters in the three groups performed similarly to some degree in both tasks. In the water task, the variancesassociated with the error for the three groups’ ratings were very close (69% for the researcher and the teacher groups, 68%for the undergraduate group). The variances of subjects and raters in the researcher and the teacher groups were similar aswell (19% and 18% for subjects and 12% and 13% for raters). But unlike the other two groups, the variances of subjects andraters in the undergraduate group were 28% and 4% respectively. This indicated that raters in this group were somewhatmore consistent in their assessment than those in the other two groups. In the earth task, the variances of subjects for threerater groups were close (32% and 35%) and they were also equal for the researcher and the teacher groups (35%). But thevariances attributable to error and raters for the earth task showed some variations across groups. The undergraduate grouphas a smaller rater variance (8% vs. 18% in the researcher group and 13% in the teacher group) but a somewhat larger errorvariance than in the other two groups (60% vs. 47% in the researcher group and 53% in the teacher group). This does not meanthat judges in the undergraduate group were more consistent than those in the other two groups in rating the earth taskdue to the larger error variance. Based on these patterns, comparatively speaking, for both tasks, judges in the researcherand the teacher groups are more alike to each other.

5.1.2. Dependability studyIn order to provide a comprehensive picture of the dependability or generalizability of the ratings with different numbers

of raters, coefficients for relative (G coefficient) and absolute (� coefficient) decisions were estimated for 1–15 raters (seeTable 3). Overall, the relative and absolute coefficients were very close in two tasks and for three rater groups. Comparativelyspeaking, the water task had lower coefficients than the earth task, indicating that raters were not as consistent in assessingthe water task as in the earth task. If coefficients of .90 or larger are viewed as excellent, .80 and greater as good, and .70and greater as acceptable (Nunnally & Bernstein, 1994; Stemler, 2004), G and � coefficients in all three rater groups forthe water task (i.e., .18–.67) are poor when there are less than or equal to 5 raters. When there are 10 raters, for the watertask, coefficients of the researcher and the teacher groups are close to acceptable level and coefficient of the undergraduategroup reaches good level. When there are 15 judges, ratings in three groups for the water task do not increase appreciably.

Compared with the water task, coefficients of the earth task improve to some degree but not substantially. When thereare 5 raters, coefficients for the three rater groups in the earth task are acceptable. When 10 raters are used in this task,coefficients in three groups increase to good level. When 15 judges are used, coefficient is close to excellent level in all of therater groups. These data suggested that only relying on raters’ personal judgment to assess creativity did not yield consistentor generalizable results, especially with a small number of raters.
Page 7: Rater effects in creativity assessment: A mixed methods investigation

5

5

r

5tpMbcso

5oarccacst

5taibm

5

dc

5art

aocbt

Tcbotncr

agb

H. Long, W. Pang / Thinking Skills and Creativity 15 (2015) 13–25 19

.2. Rater cognition

.2.1. Rating procedureBased on the qualitative data, the procedure of creativity assessment was generated from the analysis. Generally, most

aters were engaged in three steps.

.2.1.1. Preparing. Raters did not hasten to begin the assessing process. At the beginning, they read instructions and thewo science questions that the participants responded to. There were several things that occurred to raters’ minds at thisreparation stage. First, because no criteria were provided for them, they came up with their initial concepts of creativity.ost concepts at this stage were intuitive but served as the foundation for rating criteria in their later assessment. Second,

ecause they knew the responses were collected from sixth graders, they made assumptions about students’ cognitiveharacteristics and the domain knowledge the students should acquire regarding the two tasks. Third, after they read twocience questions, they automatically put themselves in the same scenarios described in the tasks and even generated theirwn answers.

.2.1.2. Scoring. The scoring process started with reading and understanding students’ responses. Raters read all or somef the responses at first to obtain an overview of the range of the answers. During reading, they also summarized the ideasnd sometimes even reconstructed the logic of the responses. After reading, raters applied their own set of criteria to theesponses. Compared to the initial concepts of creativity raters generated at the preparation stage, these criteria were moreoncrete and entailed more substantive meaning. For example, some raters viewed novelty as an important component ofreativity in their initial concept, but they were more specific on the meaning of novelty and the weight placed on novelty inpplying their criteria. Raters who generated their own answers to the two science questions at the preparation stage alsoompared students’ responses to those self-initiated answers. Some raters checked the instructions and science questionseveral times to ensure the accuracy of their judgment. They further considered the use of the 1–5 scale and distinguishedhe differences among the scales.

.2.1.3. Adjusting. Although raters were not required to compare ratings within the sample, at the end of their assessmenthey made adjustment to the original ratings they assigned. Some adjustment was based on the comparison of one rating withll the other ratings. For example, raters who included novelty as one criterion increased the original score of a responsef they found that the idea in the response had not been mentioned by any others in the sample. Some adjustment wasased on the reorganization of the criteria and the weights they placed on each criterion. Other adjustment included theodification of the 1–5 scale to 2–5 or 1–4 scale.

.2.2. Rater cognitionThe fact that most raters follow a general assessment procedure does not necessarily mean that there are no individual

ifferences. Rater cognition is manifested in the assessment procedure, the use of rating scale, and raters’ beliefs aboutreativity.

.2.2.1. Assessment procedure. During their assessment, some raters first obtained an overview, set up some benchmarks,nd assigned individual scores. Some just assigned a score to the response and did not change it later. Sometimes, raterseturned to the response and changed the score. Other times, they combined together some of these steps and thus formedheir own processes.

When assessing students’ works, a couple of raters had obtained an overview of all or some of the answers before theyssigned a score. For raters, the overview provided them with a general idea of the range of the answers or how all or somef the answers looked like. Based on this overview, some raters identified the most common responses in the sample andonsidered them as sample answers to help gauge their rating criteria. Some raters did not use the most common responses,ut considered the extreme responses, or the ones that were given 1 s or 5s, as their benchmarks. In some cases, the responseso the water task served as the benchmarks to judge the answers to the earth task.

A few raters generated their own answers or solutions to the two tasks and further compared them to students’ responses.here were two consequences of this comparison depending on which criterion was raters’ primary standard (for whatriteria raters used in their assessment, see Long, 2014). If raters considered novelty as the most important factor andelieved that only the ideas that they have never thought of can be referred to as novel, the responses that matched theirwn answers received a low score because those were ideas that they had thought of and were not qualified as novel. Buthey assigned a high score to those ideas that did not match their own answers because those were the ideas that they hadot thought of before and were qualified as novel. On the other hand, if raters considered appropriateness as their primaryriterion, the responses that matched their answers would be given a high score, and those that did not match their answerseceived a low score.

Several raters assigned a score to one response without having an overview or comparing to their own answers. Once score was given, some raters did not change it until the end of the evaluation. This is like what rater 3 in the researcherroup explained, “I understood that each student responded independently, so I consciously kept in mind that it would note fair to go back and lower the creativity ratings.” However, a couple of raters changed original ratings after they read more

Page 8: Rater effects in creativity assessment: A mixed methods investigation

20 H. Long, W. Pang / Thinking Skills and Creativity 15 (2015) 13–25

Table 4Rater use of rating procedure.

Rater group Assign score + change Assign score + no change Overview + change Others

Researcher 2 5 3 5Teacher 7 2 4 2Undergraduate 7 4 1 3

Total 16 11 8 10

Note: Assign score + change indicates that the rating procedure that the raters were engaged in was assigning a score to the response first and changingit later. Assign score + no change indicates that the rating procedure that the raters were engaged in was assigning a score to the response first and notchanging it later. Overview + change indicates that the rating procedure that the raters were engaged in was gaining an overview of the responses first andchanging the rating later. The number in each cell shows the number of raters in each group that employed the rating procedure in each column.

Table 5Rater use of rating scales.

Groups Water task Earth task

1–4 Scale 2–5 Scale 1–4 Scale 2–5 Scale 2–4 Scale

Researcher Rater 3 Rater 5, 10, 14 Rater 10, 14 Rater 5

Teacher Rater 5, 6 Rater 11, 12 Rater 3, 5, 11 Rater 4, 12Undergraduate Rater 2 Rater 14 Rater 2, 15

Note: Under. = undergraduate group. The information (i.e., the number of the raters) in each cell shows which raters in the group used that scale.

responses. For example, raters changed the rating of a response that included contents similar to the ones that they readlater. If they compared the response to others in the sample and thought that a solution that was provided by many peopleshould be given a lower rating, they lowered the original ratings. But if they found that ideas in the solutions have not beenmentioned by any others in the sample at the end of their assessment, they might go back to increase the original score.

Some raters’ rating procedures were more complex than those described above because they combined different steps.For example, they had an overview of all or some of the responses, assigned a score to each response, and changed theoriginal ratings at some point. Or they generated their answers, offered a score, and did not change it later. Furthermore,although most raters used three procedures (see Table 4): assigning a score without later change (n = 16), assigning a scorewith later change (n = 11), and gaining and overview with later change (n = 8), three rater groups also showed differencesin their general rating processes. For instance, there were seven patterns of processes in the researcher group, six in theteacher group, and four in the undergraduate group. More interestingly, five judges in the researcher group were engagedin the same pattern: assigning scores to the responses and did not change them later, and three gained an overview beforeassigning ratings and changed a few later. Seven judges in the teacher group shared the same pattern: assigning scoresand changing some later. Four judges gained an overview before giving scores and changed some later. Seven raters in theundergraduate group employed the pattern of assigning scores and changing a few later, whereas four assigned scores anddid not change them later.

5.2.2.2. Rating scales. Raters were also varied in their interpretation and use of rating scales. Most raters in the three groupsused the complete rating scale (i.e., 1–5 scale) but a couple of raters modified their rating scales from 1–5 to 2–5 or 1–4. Onerater even used a 2–4 scale when he assessed students’ responses to the earth task. In general, the mean ratings of the judgeswho used 2–5 scale were at the high end of the scores while the mean ratings of those who used 1–4 scale were at the lowend. In addition, when judges were assessing the two tasks, they were more likely to use the modified scales for the earthtask (i.e., 11 judges in total) than the water task (i.e., 9 judges in total) (see Table 5). Comparatively speaking, more judgesused the scale of 2–5 (i.e., 12 judges in total) than the scale of 1–4 (i.e., 7 judges in total) in the two tasks. This suggests thatjudges who modified rating scales tended to offer higher scores than lower scores.

Differences also existed in using the rating scales among the three rater groups in the two tasks. Four raters in theresearcher group modified scales to 2–5, 1–4, and 2–4 in one of the two tasks. Six raters in the teacher group used scales of1–4 and 2–5 in the two tasks. Three raters in the undergraduate group used two scales in both tasks. This suggests that moreraters in the undergraduate group applied the complete rating scale to their ratings than judges in the other two groups.In other words, undergraduate student raters may be slightly more consistent than the other two groups in applying theirrating scales. On the other hand, not all the raters used the same modified scale to assess responses to the two tasks. Forexample, four raters in the three groups used the modified scales only in the earth task while two raters used them only inthe water task. Two raters in the researcher group and one rater in the undergraduate group used the scale of 2–5 in bothtasks. Rater 12 in the teacher group kept using the same 2–5 scale and rater 5 in the same group used 1–4 scale in the twotasks. Interestingly, one rater in the teacher group used 2–5 scale in the water task but used 1–4 scale in the earth task.

Furthermore, raters changed their scales not arbitrarily, but for a good reason. In general, when raters did not assignanybody a 5, it was usually because they thought students’ responses were far from their criteria of the highest score, orprobably they adopted stricter criteria. Rater 5 in the teacher group was such an instance. She gave no response a 5 becauseshe thought none of the students’ responses deserved a score of 5. For her, a highly creative response should be both very

Page 9: Rater effects in creativity assessment: A mixed methods investigation

nptactrt

5Tr

ctp

bsakfwpwrimi

hsa

sifg

imfa

aoin

H. Long, W. Pang / Thinking Skills and Creativity 15 (2015) 13–25 21

ovel and appropriate. But no responses had that high degree of both qualities. When raters did not assign any 1s, it wasartly related to their beliefs. Sometimes, it was the belief about creativity, like rater 2 in the undergraduate group whohought, “In my opinion, everybody possesses creativity and nobody’s response is not creative.” Sometimes, it was the beliefbout the respondents, like rater 10 in the researcher group who explained, “It’s not easy for 6th graders to have thoserazy ideas. I rarely gave ratings of 2 unless I think it is extremely illogic or out of topic.” Sometimes, it was the belief abouthemselves as rater 14 in the researcher group stated, “I’m not familiar with these scientific questions and not sure if I ratedight or not.” Rater 5 in the researcher group did not give a 1 or a 5 to any responses to the earth task because he was one ofhose who “do not like to give extreme scores.”

.2.2.3. Raters’ beliefs about creativity. Qualitative data also indicated that raters were varied by their beliefs about creativity.hese beliefs seemed to be deeply rooted in raters’ minds and greatly influenced how they assessed creativity. Interestingly,aters held opposite opinions on the beliefs described below.

5.2.2.3.1. Creativity and knowledge. When talking about the relationship between creativity and knowledge, raters indi-ated two meanings with regard to creativity. One is being creative, which focuses on students’ creativity, or more specifically,he process of how students come up with creative solutions. The other is assessing creativity, which refers to raters’rocedure of assessing students’ creativity.

5.2.2.3.1.1. Being creative and knowledge For most raters, knowledge is the basis of being creative. Some raterselieved that people need to know their field well before they become creative. This is extremely important for creativity incience where people need to have an accurate understanding of how things work at first. The knowledge serves not only as

framework for creating new things but also as the cornerstone for people to explore the domain further. Another kind ofnowledge is acquired from what students experience in their daily life. For some raters, students’ experience is significantor being creative. They argued that students who have previous experience with the scenarios described in the two tasksould come up with more creative answers. The previous experience includes whether the students have solved the sameroblem in the past. For example, some students who have been in the survival camp know how to make fires and collectater in a deserted island. Experience also includes whether students had the chance to read or see the same situation. Some

aters noted that the water task reminded them of the Tom Hanks movie Cast Away released in 2000. They commented thatf the students watched this movie, they would know many ways to survive on an island. Some also mentioned that students

ay not know anything about coconuts, which might negatively influence how they answered the water question. Rater 7n the teacher group illustrated this point by giving an example of herself and she stated,

Say, I saw The Sound of Music before I had been to Europe, and then I went to Germany, and I saw The Sound of Musicagain, and I saw it with a whole different perspective. I’ve been to the Philippines, and I’ve seen coconut trees, butthese kids had not, and they had no idea on how hard it was to break open a coconut.

Contrary to the raters who thought that knowledge is the basis of creativity, a few raters thought that creativity can beindered by too much knowledge. This was because too much knowledge means too many restrictions. In the meantime, iftudents already knew which answer was correct, they would only write that correct answer instead of thinking deeper for

more novel idea. Rater 4 in the undergraduate group pointed out,

If a student has more knowledge about a certain topic then they might not be that creative. Instead they will try togive an answer that they think is correct and that the teacher is looking for. If they [students] have no knowledge onthe topic then it might be more creative.

5.2.2.3.1.2. Assessing creativity and knowledge Like being creative, assessing creativity also needs knowledge as thetarting point. Raters talked about how they thought that their knowledge, or the lack of knowledge, of water and climatempacted their assessment of creativity. This implies that having a good understanding of the knowledge in a field is crucialor assessing the creativity in it. Due to the lack of knowledge, raters felt insecure. This is like what rater 3 in the teacherroup indicated,

The second one [task] is harder [to assess] and I’m not feeling that I had a very good grasp of it in any way. I felt attimes I was rating in the dark. If I would have been more knowledgeable about the scientific subject matter, I wouldhave been more comfortable rating these.

A couple of judges maintained an opposite opinion. They thought that the topic knowledge they had might negativelynfluence their rating because they would focus more on correctness, instead of novelty, of the responses. Some raters

entioned that even though they had good background knowledge about water and climate, they tried to ignore it and justocused on creativity. Here they seemed to put knowledge and creativity on two opposite sides and believed that knowledgend creativity could not be combined together.

5.2.2.3.2. Creativity in science and creativity in other fields. Raters also held opposite opinions on the differences of cre-

tivity in science and in other fields. Some raters believed that creativity is different in various fields. It is more practicalr factual in science than in other fields, especially when compared with writing and the arts which can be completelymaginative. In addition, creativity in writing and the arts is similar to expressing people’s feelings but creativity in scienceeeds to focus on solving real-life problems. Rater 9 in the teacher group elaborated on this point. She indicated,
Page 10: Rater effects in creativity assessment: A mixed methods investigation

22 H. Long, W. Pang / Thinking Skills and Creativity 15 (2015) 13–25

There are two areas of creativity in making scientific models. In other words, trying to say this is why things happenand proposing these are the factors that are involved, this is how they interact. The other one is using what we knowabout how things work to solve other problems we have. And that is different creativity from in art or music or danceor in other things. That is a totally different kind of creativity and it has to be rooted in what really works.

A few raters also believed that there is no creativity in science because scientific thought involves lots of understandingand interpretation and there are only right or wrong responses, but not creative or uncreative ones. Rater 13 in the researchergroup pointed out, “it seems difficult to make scientific explanations creative—you can make presentations creative. . ., butexplaining the tilt of the earth, well, there doesn’t seem to be too much creativity about it, just understanding really.”

Not all the raters believed that scientific creativity is different from other creativities. This is especially true in assessingcreativities. Some raters stated that they would use the criteria similar to assessing scientific creativity to evaluate creativityin other fields because their conception of creativity is the same to all the fields. Some mentioned that they prefer to applythe same method to assess creativity in other fields.

Generally, raters who emphasized the importance of knowledge in deriving creative ideas and who believed that thereis little room for creativity in the science focused more on appropriateness of the ideas, showed more reflections in theirassessment, and adopted more stringent rating scales. In addition, raters who indicated their uncertainty of their knowledgein assessing creativity and who believed that scientific creativity is similar to creativity in other fields focused more onnovelty of the ideas, showed fewer reflections in their assessment, and adopted less stringent rating scales.

6. Discussion

This study employed mixed methods to examine rater effects. Due to different natures of the two aspects of rater effects,only research using both quantitative and qualitative methodologies can do justice to the complexity of the phenomenon. Inthis study, two sets of data complement each other and form a complete picture of the rater idiosyncrasies. More specifically,quantitative data, on one hand, showed the existence of large inconsistency in the ratings and among rater groups. Qualitativedata, on the other hand, provided insights about the underlying reasons of the inconsistencies and variations. Furthermore,two sets of data triangulate each other, which provides validity evidence for this study. For instance, based on the analysis ofraw ratings, raters in the undergraduate group were more consistent in their rating. This was supported by the conclusionsderived from qualitative data, which suggested that undergraduate rater group were engaged in fewer rating patterns thanthe other two groups and more raters in the same group used complete rating scale.

The results generated from this study are consistent with previous findings. The analysis of the raw ratings in this studyshowed that the dependability of the ratings assigned by human judges might be problematic if raters assessed creativityonly based on their own criteria or their own definition of creativity. Plucker and Makel (2010) observed that most reliabilitycoefficients of creativity assessment that uses rater judgment are only at acceptable level (i.e., .70). Silvia et al. (2008) showedthat in using two subjective scoring methods to assess divergent thinking tasks, the variances associated with participants-raters interactions and other random errors in other divergent thinking tasks were between 24% and 41% and there weremore variances associated with raters than with performance for consequences task. Hung et al. (2012) also found rater andcriterion interaction even after raters were trained.

The inconsistencies in ratings are also closely associated with the criterion issue presented in assessing creativity.Recently, Long (2014) studied the rating criteria employed by raters in their creativity assessment. The research foundthat raters mainly used five criteria to assess creativity, including appropriateness, novelty, thoughtfulness, interestingness,and cleverness. However, raters have different interpretations for each criterion. For example, appropriateness means use-fulness, fit, logic, correctness, and completeness of the solutions. When using novelty as one of the criteria, raters comparedthe uniqueness of the solutions to the sample, themselves, or the response forms. What’s more, raters used different sets ofcriteria, chose varied primary criterion and applied it with distinct weights.

In addition, this study concluded that novice rater group performed better in their assessment than expert and quasi-expert groups. There is no shortage of the studies in the literature showing that experts’ ratings possibly lead to poorreliability. An extreme example is Hickey’s (2001) study which reported that Cronbach’s alpha of the ratings of three expertcomposers was only .04. Amabile (1982) also found that the inter-judge agreement for nonexperts was even higher than forexperts. Kaufman et al. (2009) concluded that novice judges yielded ratings that were as reliable as expert judges.

This study also supported that reliability coefficients might vary by tasks even for the same population. For instance,in his study, Baer (1994) found that long-term stability coefficients for story-writing and story-telling tasks with the samegroup of fourth graders were different. Kaufman et al. (2008) and Kaufman et al. (2009) asked two same groups of judgesto use the Consensual Assessment Technique and assess SciFaiku poems (2008) and short stories (2009) respectively andobtained contrary results in terms of the correlation between expert and novice raters’ ratings, that is, a very low correlation(r = .22) in one study (2009) and a moderate correlation (r = .71) in the other (2008).

Furthermore, the steps of rating procedure and rater cognition identified in this study are similar to those found in the

literature. The three steps of preparing, scoring, and adjusting in creativity assessment are identical to the three steps ofassessing ESL writing (Lumley, 2002): reading, rating based on categories, and reevaluating the assigned scores. The threesteps also resemble a model of rater cognition in assessing constructed responses developed by Bejar (2012) that includesassessment design and scoring phases. The first phase identifies the evidence for the judgment of the performance level
Page 11: Rater effects in creativity assessment: A mixed methods investigation

aAe

CwrtVo

actki221Dsaoic

maHitntoHaao

iK2ripvfoocwsqKakhin

g

H. Long, W. Pang / Thinking Skills and Creativity 15 (2015) 13–25 23

nd functions as the preamble to scoring, whereas the second phase is the actual process of assigning scores to the work.dditionally, strategies, such as, planning (Bejar, 2012; Crisp, 2012), scanning, scrutinizing (Suto & Greatorex, 2008), andmotional reaction (Crisp, 2012) are also employed in the rating process of creativity.

Findings in this study also corroborated previous conclusions regarding rater cognition. For instance, Bejar (2012) andrisp (2008a, 2012) noted that raters often form their own mental rubric before assessment and apply it during assessment,hether or not specific rubrics are provided. Because raters vary by their background and previous knowledge, their mental

ubrics differ. Grainger, Purnell, and Kipf (2008) also indicated that raters who take into account similar characteristics ofhe works and have similar understanding of the works still assign different ratings to the same piece of work. According toaughan (1991), raters tend to consider important or most important aspects that they believe in their assessment. Becausef this, even experienced raters do not apply the rating rubrics uniformly to the works they assess.

Individuals’ perceptions and implicit theories about human traits have a great impact on their understanding of humanctions (Chan & Chan, 1999; Dweck, Chiu, & Hong, 1995). People’s beliefs about creativity also influence how they viewreativity (Lim & Plucker, 2001; Sternberg, 1986). The relationship between creativity and knowledge is a controversialopic in the field. A number of researchers emphasized the importance of knowledge in deriving creative ideas. For instance,nowledge is one of the six resources of creativity in Sternberg and Lubart’s (1995, 1996) investment theory of creativity. Its also an initial requirement for being creative in Amusement Park Theoretical (APT) model of creativity (Baer & Kaufman,005). The famous ten year rule (Hayes, 1989; Weisberg, 1999) is applied to many domains. Previous studies (e.g., Author,011; Weisberg, 1999) further supported the importance of experience to creativity. However, many psychologists (DeBono,968; Guilford, 1950; James, 1908; Koestler, 1964) maintained that too much knowledge is harmful to creativity. For example,eBono (1968) contended, “Too much experience within a field may restrict creativity because you know so well how things

hould be done that you are unable to escape to come up with new ideas” (p. 228). Amabile (1989) and Hausma (1984)lso argued that people could not apply what they experienced in the past to new situations. One theory that combines twopposite ideas together is developed by Simonton (1984), who contended that the relation between education and creativitys curvilinear, suggesting that either fewer years of education or too many years of training has a negative impact on people’sreativity.

The rater effects identified in creativity assessment are also reasonable if examined with the general assumptions ofeasurement. In fact, relying only on raters’ individual judgment does not agree with those assumptions. When measuring

construct, it is assumed that the characteristic reflected in the construct is universal among individuals (Messick, 1989).owever, when only individual judgment is involved in the measurement, it can hardly tell what universal characteristic

s implied in the construct because the judgment not only varies with situations but also depends on characteristics of therait that are assessed. Another key assumption in measurement is the quantification of the construct, or, in other words,umbers are used to describe characteristics of the individuals. When numbers are assigned to the responses, it is assumedhat all of the ratings are objective. When individual judgments are involved in the measurement, raters can assign a numberr a rating to the products they are measuring, however, the meanings of these numbers are lost in the process (Smith &eshusius, 1986). In addition, it is assumed in measurement that there is a domain and the items drawn from the domainre representative (Messick, 1989). This is significant for generalizing the items to the universe of the characteristics. When

trait is measured by using individual judgment, the domain is as hard to define as the trait itself and what is representativef the domain is also debatable.

However, it is extensively demonstrated in the literature that if provided with robust and appropriate training, whichncludes well-defined scoring rubrics and anchor work prior to assessment, rater effects might be minimized (Dunbar,oretz, & Hoover, 1991; Lumley, 2002; Shavelson, Baxter, & Gao, 1993; Shohamy, Gordon, & Kraemer, 1992; Sudweeks et al.,005; Weigle, 1991). Therefore, a well-designed training session is necessary to improve creativity assessment. Some earlyesearch (e.g., Dollinger & Shafran, 2005) has also demonstrated the effectiveness of training on creativity assessment. Thedea of training raters seems to be in conflict with the assumptions of Consensual Assessment Technique (CAT), the mostopular approach of using human judges to assess creativity. According to Amabile (1982), because there is a definitionaloid problem in creativity assessment, we rely on appropriate judges who might not be highly creative but have at least someamiliarity with the domain of interest. Judges are not provided with any criteria or training but they use their own definitionf creativity to assess creativity. At the same time, Amabile required that the tasks in the assessment should not be basedn certain specialized skills but should be open-ended in nature. Obviously, CAT approach avoids the difficulty of definingreativity, yet many questions are open to debate. For instance, who are appropriate judges for assessing creativity: peopleho have expertise in creativity research or in domain knowledge, or in both areas? How much familiarity of the domain is

ufficient? How do judges assess creativity of the tasks that involve domain-specific knowledge or skills? Whether novice oruasi-experts can be used as judges in assessing creativity is still inconclusive due to the mixed results in the literature (e.g.,aufman et al., 2005, 2008, 2009). However, as noted by Kaufman et al. (2008), if we hope to use novices and quasi-expertss expert judges in order to decrease the cost of employing experts, we need to train them to be experts, not only in domainnowledge but also in creativity. Furthermore, when assessing something like science creativity, we do need judges whoave expertise in both creativity and domain knowledge (even as simple as water condensation and weather change). But

t might be very difficult to find individuals who are experts in both fields; therefore, providing trainings to judges is stillecessary in this sense.

This study has several limitations. It has a small sample size of students and judges although over 2,000 ratings wereenerated. In addition, judges, especially those in the researcher and the teacher groups, have dissimilar backgrounds. For

Page 12: Rater effects in creativity assessment: A mixed methods investigation

24 H. Long, W. Pang / Thinking Skills and Creativity 15 (2015) 13–25

example, educational researchers are varied in their programs (i.e., master vs. doctoral), the research experience (e.g., afew years vs. less than 1 year), and the major research interests (e.g., educational psychology vs. elementary education).Teachers are also varied in their age and teaching experience. Meanwhile, most of the judges were females and they werenot randomly selected, so they might not be representative of the entire population. These might affect the results of thestudy to some degree. In the future, a few interesting questions warrantee further investigation. For instance, how raters’demographic background, such as, gender and previous experience, affects their creativity assessment? How does raters’domain knowledge play a role in assessing creativity of science? What are the differences in terms of rating and ratercognition between analytic and holistic rating approaches in creativity assessment?

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards foreducational and psychological testing. Washington, DC: AERA.

Amabile, T. M. (1982). Social psychology of creativity: A consensual assessment technique. Journal of Personality and Social Psychology, 43, 997–1013.http://dx.doi.org/10.1037/0022-3514.43.5.997

Amabile, T. M. (1989). Growing up creative: Nurturing a lifetime of creativity. New York, NY: Crown.Baer, J. (1994). Performance assessments of creativity: Do they have long-term stability? Roeper Review, 17, 7–12. http://dx.doi.org/10.1080/

02783199409553609Baer, J., & Kaufman, J. C. (2005). Bridging generality and specificity: The Amusement Park Theoretical (APT) model of creativity. Roeper Review, 27, 158–163.

http://dx.doi.org/10.1080/02783190509554310Bejar, I. I. (2012). Rater cognition: Implications for validity. Educational Measurement: Issues and Practice, 31(3), 2–9. http://dx.doi.org/

10.1111/j.1745-3992.2012.00238.xCampbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105.

http://dx.doi.org/10.1037/h0046016Chan, D. W., & Chan, L.-K. (1999). Implicit theories of creativity: Teachers’ perception of student characteristics in Hong Kong. Creativity Research Journal,

12, 185–195. http://dx.doi.org/10.1207/s15326934crj1203 3Charney, D. (1984). The validity of using holistic scoring to evaluate writing: A critical overview. Research in the Teaching of English, 18, 65–81.

〈http://www.jstor.org/stable/40170979〉.Cohen, L., & Manion, L. (1994). Research methods in education. New York, NY: Routledge.Creswell, J. W. (2014). Research design: Qualitative, quantitative, and mixed methods approaches (4th ed.). Los Angeles, CA: SAGE.Crisp, V. (2008a). Exploring the nature of examiner thinking during the process of examination marking. Cambridge Journal of Education, 38, 247–264.

http://dx.doi.org/10.1080/03057640802063486Crisp, V. (2008b). The validity of using verbal protocol analysis to investigate the processes involved in examination marking. Research in Education, 79,

1–12.Crisp, V. (2012). An investigation of rater cognition in the assessment of projects. Educational Measurement: Issues and practice, 31(3), 10–20.Cronbach, L. J. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64, 391–418.

http://dx.doi.org/10.1177/0013164404266386Cronbach, L. J., Rajaratnam, N., & Gleser, G. C. (1963). Theory of generalizability: A liberalization of reliability theory. British Journal of Statistical Psychology,

16, 137–163. http://dx.doi.org/10.1111/j.2044-8317.1963.tb00206.xCumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7, 31–51. http://dx.doi.org/10.1177/026553229000700104Cumming, A., Kantor, R., & Powers, D. E. (2001). Scoring TOEFL essays and TOEFL 2000 prototype writing tasks: An investigation into raters’ decision making and

development of a preliminary analytic framework (research bulletin no. RM-01-04). Princeton, NJ: Educational Testing Service.DeBono, E. (1968). New think: The use of lateral thinking in the generation of new ideas. New York, NY: Basic.Dollinger, S. J., & Shafran, M. (2005). Notes on consensual assessment technique in creativity research. Perceptual and Motor Skills, 100, 592–598.Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991). Quality control in the development and use of performance assessments. Applied Measurement in

Education, 4, 289–303. http://dx.doi.org/10.1207/s15324818ame0404 3Dweck, C. S., Chiu, C.-Y., & Hong, Y.-Y. (1995). Implicit theories and their role in judgments and reactions: A word from two perspectives. Psychological

Inquiry, 6, 267–285. http://dx.doi.org/10.1207/s15327965pli0604 1Grainger, P., Purnell, K., & Zipf, R. (2008). Judging quality through substantive conversations between markers. Assessment and Evaluation in Higher Education,

33, 133–142. http://dx.doi.org/10.1080/02602930601125681Guilford, J. P. (1936). Psychometric methods. New York, NY: McGraw-Hill.Guilford, J. P. (1950). Creativity. American Psychologist, 5, 444–454.Hamp-Lyons, L. (2007). Worrying about rating [Editorial]. Assessing Writing, 12, 1–9. http://dx.doi.org/10.1016/j.asw.2007.05.002Hausma, C. (1984). Discourse on novelty and creation. Albany, NY: State University of New York Press.Hayes, J. R. (1989). Cognitive processes in creativity. In J. A. Glover, R. R. Ronning, & C. R. Reynolds (Eds.), Handbook of creativity (pp. 135–145). New York,

NY: Plenum.Hickey, M. (2001). An application of Amabile’s consensual assessment technique for rating the creativity of children’s musical composition. Journal of

Research in Music Education, 49, 234–244. http://dx.doi.org/10.2307/3345709Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough: Teacher observation systems and a case for the generalizability

study. Educational Researcher, 41, 56–64. http://dx.doi.org/10.3102/0013189X12437203Huot, B. A. (1993). The influence of holistic scoring procedures on reading and rating student essays. In M. M. Williamson, & B. A. Huot (Eds.), Validating

holistic scoring for writing assessment (pp. 206–236). Cresskill, NJ: Hampton Press.Hung, S.-P., Chen, P.-H., & Chen, H.-C. (2012). Improving creativity performance assessment: A rater effect examination with many facet Rasch model.

Creativity Research Journal, 24, 345–357. http://dx.doi.org/10.1080/10400419.2012.730331Iramaneerat, C., Yudkowsky, R., Myford, C. M., & Downing, S. M. (2008). Quality control of an OSCE using generalizability theory and many-faceted Rasch

measurement. Advances in Health Science Education: Theory and Practice, 13, 479–493. http://dx.doi.org/10.1007/s10459-007-9060-8James, W. (1908). Talks to teachers and psychology. New York, NY: Henry Holt.Johnson, R. B., Onwuegbuzie, A. J., & Turner, L. A. (2007). Toward a definition of mixed methods research. Journal of Mixed Methods Research, 1, 112–133.

http://dx.doi.org/10.1177/1558689806298224Johnson-Laird, P. N. (1983). Mental models: Towards a cognitive science of language, inference, and consciousness. Cambridge, MA: Harvard University Press.

Kahneman, D., & Frederick, S. (2002). Representativeness revisited: Attribute substitution in intuitive judgment. In T. Gilovich, D. Griffin, & D. Kahneman

(Eds.), Heuristics and biases: The psychology of intuitive judgment (pp. 49–81). Cambridge, UK: Cambridge University Press.Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Westport, CT: American Council on Education/Praeger.Kaufman, J. C., & Baer, J. (2012). Beyond new and appropriate: Who decides what is creative. Creativity Research Journal, 24, 83–91.

http://dx.doi.org/10.1080/10400419.2012.649237

Page 13: Rater effects in creativity assessment: A mixed methods investigation

K

K

K

K

KLL

L

L

L

LL

L

MM

M

NNP

R

S

S

SS

S

SS

S

S

S

S

SSS

S

V

WW

W

WW

H. Long, W. Pang / Thinking Skills and Creativity 15 (2015) 13–25 25

aufman, J. C., Baer, J., & Cole, J. C. (2009). Expertise, domains, and the consensual assessment technique. Journal of Creative Behavior, 43, 223–233.http://dx.doi.org/10.1002/j.2162-6057.2009.tb01316.x

aufman, J. C., Baer, J., Cole, J. C., & Sexton, J. D. (2008). A comparison of expert and nonexpert judges using the consensual assessment technique. CreativityResearch Journal, 20, 171–178. http://dx.doi.org/10.1080/10400410802059929

aufman, J. C., Baer, J., & Gentile, C. A. (2004). Differences in gender and ethnicity as measured by ratings of three writing tasks. Journal of Creative Behavior,39, 56–69. http://dx.doi.org/10.1002/j.2162-6057.2004.tb01231.x

aufman, J. C., Gentile, C. A., & Baer, J. (2005). Do gifted students writers and creative writing experts rate creativity the same way? Gifted Child Quarterly,49, 260–265. http://dx.doi.org/10.1177/001698620504900307

oestler, A. (1964). The act of creation. London: Macmillan.aming, D. (2004). Human judgment: The eye of the beholder. London, UK: Thompson Learning.im, W., & Plucker, J. A. (2007). Creativity through a lens of social responsibility: Implicit theories of creativity with Korean samples. Journal of Creative

Behavior, 35, 115–130. http://dx.doi.org/10.1002/j.2162-6057.2001.tb01225.xinacre, J. M. (1996). Generalizability theory and many-facet Rasch measurement. In G. Engelhard Jr., & M. Wilson (Eds.), Objective measurement: Theory

into practice (vol. 3) (pp. 85–98). Norwood, NJ: Ablex.ong, H. (2011). Activities before idea generation in creative process: What do people do to catch their Muse? International Journal of Creativity and Problem

Solving, 21(2), 39–56.ong, H. (2014). More than appropriateness and novelty: Judges’ criteria of assessing creative products in science tasks. Thinking Skills and Creativity, 13,

183–194. http://dx.doi.org/10.1016/j.tsc.014.05.002ong, H. (in press). An empirical review of research methodologies and methods in creativity studies (2003-2012). Creativity Research Journal.umley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters. Language Testing, 19, 246–276.

http://dx.doi.org/10.1191/0265532202lt230oaynch, B. K., & McNamara, T. F. (1998). Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL

speaking skills of immigrants. Language Testing, 15, 158–180. http://dx.doi.org/10.1177/026553229801500202essick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (pp. 13–103). New York, NY: Macmillan Publishing Company.essick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score

meaning. American Psychologist, 50, 741–749. http://dx.doi.org/10.1037/0003-066X.50.9.741yford, C. M. (2012). Rater cognition research: Some possible directions for the future. Educational Measurement: Issues and Practice, 31, 48–49.

http://dx.doi.org/10.1111/j.1745-3992.2012.00243.xoyes, E. S. (1963). Essay and objective tests in English. College Board Review, 49, 7–10.unnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York, NY: McGraw-Hill.lucker, J. A., & Makel, M. C. (2010). Assessment of creativity. In J. C. Kaufman, & R. J. Sternberg (Eds.), The Cambridge handbook of creativity (pp. 48–73). NY:

Cambridge University Press.unco, M. A., McCarthy, K. A., & Svenson, E. (1994). Judgments of the creativity of artwork from students and professional artists. Journal of Psychology, 128,

23–31. http://dx.doi.org/10.1080/00223980.1994.9712708adler, D. R. (1989). Formative assessment and the design of instructional systems. Instructional Science, 18, 119–144. http://dx.doi.org/

10.1080/00223980.1994.9712708havelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performance assessments. Journal of Educational Measurement, 30, 215–232.

http://dx.doi.org/10.1111/j.1745-3984.1993.tb00424.xhavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. London: SAGE Publications.hohamy, E., Gordon, C. M., & Kraemer, R. (1992). The effect of raters background and training on the reliability of direct writing tests. Modern Language

Journal, 76, 27–33. http://dx.doi.org/10.1111/j.1540-4781.1992.tb02574.xilvia, P. J., Winterstein, B. P., Willse, J. T., Barona, C. M., Cram, J. T., Hess, K. I., et al. (2008). Assessing creativity with divergent thinking

tasks: Exploring the reliability and validity of new subjective scoring methods. Psychology of Aesthetics, Creativity, and the Arts, 2, 68–85.http://dx.doi.org/10.1037/1931-3896.2.2.68

imonton, D. K. (1984). Genius, creativity, and leadership. Cambridge: Cambridge University Press.mith, E. V., Jr., & Kulikowich, J. M. (2004). An application of generalizability theory and many-facet Rasch measurement using a complex problem-solving

skills assessment. Educational and Psychological Measurement, 64, 617–639. http://dx.doi.org/10.1177/0013164404263876mith, J. K., & Heshusius, L. (1986). Closing down the conversation: The end of the quantitative–qualitative debate among educational inquirers. Educational

Researcher, 15, 4–12. http://dx.doi.org/10.3102/0013189X015001004tanovich, K., & West, R. (2002). Individual differences in reasoning: Implications for the rationality debate? In T. Gilovich, D. Griffin, & D. Kahneman (Eds.),

Heuristics and biases: The psychology of intuitive judgment (pp. 421–440). Cambridge, UK: Cambridge University Press.temler, S. E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research

& Evaluation, 9(4). 〈http://PAREonline.net/getvn.asp?v=9&n=4〉 (Retrieved July, 2014).ternberg, R. J. (1986). Intelligence, wisdom, and creativity: Three is better than one. Educational Psychologist, 21, 175–190. http://dx.doi.org/

10.1207/s15326985ep2103 2ternberg, R. J., & Lubart, T. I. (1995). Defying the crowd: Cultivating creativity in a culture of conformity. New York, NY: Free Press.ternberg, R. J., & Lubart, T. I. (1996). Investing in creativity. American Psychologist, 51, 677–688. http://dx.doi.org/10.1037/0003-066X.51.7.677udweeks, R., Reeve, S., & Bradshaw, W. S. (2005). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college

sophomore writing. Assessing Writing, 9, 239–261. http://dx.doi.org/10.1016/j.asw.2004.11.001uto, W. M. I., & Greatorex, J. (2008). What goes through an examiner’s mind? Using verbal protocols to gain insights into the GCSE marking process. British

Educational Research Journal, 34, 213–233. http://dx.doi.org/10.1080/01411920701492050aughan, C. (1991). Holistic assessment: What goes on in the rater’s mind? In L. Hamp-Lyons (Ed.), Assessing second language writing in academic contexts

(pp. 111–125). Norwood, NJ: Ablex.eigle, S. C. (1991). Effects of training on raters of ESL compositions. Language Testing, 11, 197–223. http://dx.doi.org/10.1177/026553229401100206eigle, S. C. (1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing, 6, 145–178.

http://dx.doi.org/10.1016/S1075-2935(00)00010-6

eisberg, R. W. (1999). Creativity and knowledge: A challenge to theories. In R. J. Sternberg (Ed.), Handbook of Creativity (pp. 226–250). Cambridge:

Cambridge University Press.olfe, E. W. (2004). Identifying rater effects using latent trait models. Psychology Science, 46, 35–51.olfe, E. W., & McVay, A. (2012). Application of latent trait models to identifying substantively interesting raters. Educational Measurement: Issues and

Practice, 31, 31–37. http://dx.doi.org/10.1111/j.1745-3992.2012.00241.x