validity: opportunities & challenges - aubrey logan-terry...employs a computerized readability...
TRANSCRIPT
East Coast Organization of Language Testers
Conference 2011
Friday, October 28
th
Saturday, October 29th
Georgetown University
Washington, DC, USA
Validity:
Opportunities & Challenges
National Capital Language Resource Center
Center for Applied Linguistics
Second Language Testing, Inc.
Georgetown University
ECOLT Conference Committee
Tim Farnsworth, Program Chair, Hunter College, the City University of New York
Margaret E. (Meg) Malone, Conference Co-chair, Center for Applied Linguistics
Francesca Di Silvio, Conference Co-chair, Center for Applied Linguistics
ECOLT Abstract Reviewers
Rachel Brooks, Federal Bureau of Investigation
Jee Wha Dakin, Oxford University Press
Talia Isaacs, University of Bristol
Lorena Llosa, New York University
Margaret Malone, Center For Applied Linguistics
Maria McCormack, Teachers College, Columbia University
Christine Rosalia, CUNY Hunter College
Sun-Young Shin, Indiana University at Bloomington
Elvis Wagner, Temple University
The Conference Committee would like to thank the following
individuals and organizations for their support of ECOLT 2011:
Dorry Kenyon, Anne Donovan, and Aileen Bach, Center for Applied Linguistics
Mackenzie Price and Colleen Moorman, Georgetown University
Scott McGinnis, Defense Language Institute
Charles Stansfield, Second Language Testing, Inc.
National Capital Language Resource Center
Second Language Testing, Inc.
The Testing Committee of the Interagency Language Roundtable
Special thanks to Jeff Connor-Linton, Georgetown University, and Anup Mahajan,
National Capital Language Resource Center
Agenda
Friday, October 28, 2011
All paper presentations will be given in the Intercultural Center (ICC) Auditorium, 2nd
floor. Please note
that the main entrance to the ICC is on the 3rd
floor.
Refreshments will be located in the foyer of the auditorium.
Poster presentations will be located in ICC room 107, 1st floor.
8:30-9:30 Registration
9:30-9:45 Introductions
9:45-10:45 ILR Panel
10:45-11:00 Break
11:00-12:00 Plenary Presentation
Validity Arguments for Language Assessment: Opportunities and Challenges
Carol A. Chapelle
12:00-12:30 Paper Session 1 (1 paper)
The Assessment of Reading Comprehension Skills for Immigrants: The Case of the
Netherlands
Ryan Downey, Jo Fond Lam, Alistair Van Moere
12:30-1:30 Lunch
1:30-3:00 Paper Session 2 (3 papers)
Relationships between Linguistic Complexity Measures and Student Scores on a 6th
Grade Science Assessment
Aubrey Logan-Terry, Timothy Farnsworth
Validating an Assessment Framework of Linguistic Knowledge for Teaching Math and
Science to English Language Learners
Sultan Turkan, Jerry Bicknell
Assessing ELL Content in the Mainstream Classroom: Teacher Decision-making
Processes
Beth Clark-Gareca
3:00-3:15 Break
3:15-4:15 Paper Session 3 (2 papers)
Differential Item Functioning in a High Stakes Test
Mohammed Salehi, Alireza Tayebi
Evaluating Oral Collocational Production to Predict L2 Oral Proficiency
Jing Xu
4:15-4:30 Break/Poster Set Up
4:30-6:00 Poster Session (8 posters)
Investigating Assessment Literacy through Teacher Professional Development
Aileen Bach, Anne Donovan
The Design of an Oral Test for Foreign Language Teachers: Validating the Vocabulary
Descriptors
Melissa Alves Baffi-Bonvino
Withdrawing from the Bank: Item Stability across Multiple Testing Contexts
Martyn Clark
Bridging the Gap: How Language Testers Can Build Assessment Literacy in
Practitioners of Other Fields
Anne Donovan, Margaret E. Malone, Francesca Di Silvio, Megan Montee
Transcription as a Language Testing Tool
Beth Mackey
Task Complexity Features and Speaking Test Performance
Megan Montee
Best Practices in Pilot Testing a High-stakes English Language Proficiency Test
Abbe Spokane, Tiffany Yanosky
Implementing Cognitive Diagnostic Assessment in an Institutional Test through
Collaboration of Language Testers: A New Networking Model in Language Testing
Yeon-Sook Yi, Stephanie Gaillard
Saturday, October 29, 2011
8:30-9:30 Registration
9:30-11:00 Paper Session 1 (3 papers)
Assessing Lower-order and Higher-order Listening Skills for ESL Students
Tatiana Nekrasova-Becker, Anthony Becker
Building Evidence for the Evaluation of English Learners’ Writing Scores
Anthony Becker
Textual Borrowing and Rater Perceptions in Integrated Writing Tasks
Sara Cushing Weigle, Megan Montee
11:00-11:15 Break
11:15-12:45 Paper Session 2 (3 papers)
The Development of Specifications for an Oral Proficiency Test: Contributions to
Validity Claims
Francesca Di Silvio, Anne Donovan, Beth Mackey
Assessing Learning Outcomes in Short-term Foreign Language Programs: Validation
Results of a Triangulated Assessment System
Megan C. Masters, Steven J. Ross, Margaret E. Malone
Investigating the Construct Validity of the Grammar Sub-Test of the CEP Placement
Exam
Payman Vafaee, Nesrine Basheer, Reese Heitner
Paper abstracts listed in presentation order
ILR Panel
Standard Setting in the Department of Defense (DoD)
The ILR Testing Committee
The ILR Testing Committee will host a panel discussion on standard setting. Standard setting is a well-
established method in educational testing for having experts determine what kind of performance on a
test demonstrates a given level of competence. Determining how to use standard-setting
recommendations is a key issue in building a validity argument.
Representatives from several agencies will discuss their roles in the standard setting process, moving
through the chronology of a standard setting workshop. The process as implemented by the Defense
Language Institute Foreign Language Center (DLIFLC) begins with the workshop participants, who
represent a large cross-section of stakeholders. Participants are trained in general Borderline Proficiency
Level Descriptions (BPLDs), which are statements derived from the ILR Skill Level Descriptions to
capture the minimum performance expectations for each level. Standard-setting participants
independently make a judgment for each item on the standard-setting test form about whether examinees
at each borderline would or would not know the answer to that item. Statisticians at the DLIFLC review
the results upon completion of the panel; information is compiled and cut scores are transformed onto
the Item Response Theory theta scale.
Panelists will include stakeholders, standard-setting participants, standard-setting planners/organizers,
Defense Language Proficiency Testing program management, and statisticians. The goal of the ECOLT
discussion is to provide a forum for discussion of a real-world standard setting activity, drawing upon
various perspectives. Because of the high-stakes nature of the Defense Language Proficiency Testing
System, constant improvement to testing is a critical component in ensuring test validity. We expect that
this topic will appeal to the ECOLT audience as it represents the intersection of theory and practice.
Notes
Plenary
Validity Arguments for Language Assessment: Opportunities and Challenges
Carol A. Chapelle
Iowa State University
Approaches to validity arguments making their way into language assessment offer some important
advances. Kane (2006) and others working in this area provide basic conceptual tools which are aimed,
in part, at challenging test developers and users to make explicit the intended interpretations and uses of
test results in addition to the evidence and rationales that support such interpretations and uses. The
overall goal of reaching a well-justified conclusion (about test interpretation and use) is similar to past
practices in test validation; what is different is the specific conceptual infrastructure offered for doing so
(Chapelle, Enright, & Jamieson, 2010).
I will suggest that such tools for developing validity arguments are important for language testing
because of their role in prompting clear and detailed interpretive arguments, which point to research
needed to support the argument’s claims. At the same time, the opportunity to include such delicate
levels of specification, evidence and rationales raises challenges which may help to push applied
linguistics farther.
I will provide examples of some of the opportunity/challenge points that language testers face as they
use the tools of current validity argument approaches. One is the potential need that arises for an
inference in the argument to warrant the sampling of specific linguistic features in test development. A
second is the need for substantial evidence supporting scoring rules on constructed response tests. A
third is the need to converge interpretive arguments from multiple assessment procedures to warrant a
decision. These issues are not necessarily unique to language assessment, but here they become apparent
because of the detail with which language ability can be defined and the need to specify the relevant
domain of language in most interpretive arguments.
References
Chapelle, C. A., Enright, M. E., & Jamieson, J. (2010). Does an argument-based approach to
validity make a difference? Educational Measurement: Issues and Practice, 29(1), 3-13.
Kane, M. (2006). Validation. In R. Brennan, (Ed.), Educational Measurement (4th Edition), (pp 17-64).
Westport, CT: Greenwood Publishing.
Notes
Paper
The Assessment of Reading Comprehension Skills for Immigrants:
The Case of the Netherlands
Ryan Downey
Pearson Knowledge Technologies
Jo Fond Lam
CINOP
Alistair Van Moere Pearson Knowledge Technologies
The use of language assessments for immigration screening has wide-ranging implications for validity
and score use. This paper analyses the case of the Netherlands, where candidates for immigration must
currently pass three tests which are administered in embassies worldwide: an oral test of knowledge
about Dutch culture, a test of oral proficiency in Dutch language, and a newly-implemented test of basic
literacy and reading comprehension in Dutch. All three tests are automated and delivered by telephone;
the candidate hears recorded test questions or prompts over the telephone, and the candidate’s spoken
responses are scored using speech processing technology.
The new test – the "Toets Geletterdheid en Begrijpend Lezen" ("Test of Literacy and Reading
Comprehension") was developed under the assumption that social integration will be more successful if
immigrants have basic (CEFR A1 level) reading comprehension skills in Dutch. In some sections of the
test the candidate is instructed to read aloud from a test paper and in other sections the candidate must
read silently and then give spoken answers to written comprehension questions.
There were many challenges to constructing a validity argument for the Dutch reading test. This
presentation discusses the development of the test in accordance with the policy requirements of the
government of the Netherlands, and then describes several studies aimed at providing valid
interpretations of test scores. Important questions pertaining to validity are raised and answered,
including: What are the challenges to the validity of a reading test which must elicit oral responses to
comprehension questions, and how are they overcome? Why was the construct of ―literacy‖ defined by
the government and how is it assessed? Can the machine scoring system disassociate construct-
irrelevant abilities, such as pronunciation? And finally, where is the line drawn between ethics of
language test construction and policy implementation?
Notes
Paper
Relationships between Linguistic Complexity Measures and Student Scores
on a 6th Grade Science Assessment
Aubrey Logan-Terry
Georgetown University
Timothy Farnsworth
Hunter College, City University of New York
The validity of English language content tests for English language learners (ELLs) has recently been an
area of intense research interest, because researchers have argued that language serves as a source of
construct-irrelevant variance in these tests (e.g. Abedi, 2006; Abedi, Courtney, & Leon, 2001; Abedi &
Lord, 2001; Farnsworth, 2006; Martiniello, 2009; Staley, 2005; Wolf & Leon, 2009); however, other
recent research suggests that linguistic aspects of test items may be construct-relevant (Bailey, 2005;
Farnsworth, 2008) and differences in performance across groups may be due to factors other than
language barriers in the tests (Bachman & Koenig, 2004; Elliot, 2008; Ockey, 2007). The present study
employs a computerized readability tool (Coh-Metrix) and Multilevel Modeling (MLM) to investigate
whether linguistic complexity of test items is predictive of students' item-level scores, and whether the
relationships vary between ELLs and non-ELLs, thereby shedding light on possible language barriers to
performance in the tests. Preliminary findings from analysis of a corpus of 6th grade classroom science
tests (n = 852 students) indicate that linguistic complexity of test items is an important predictor of all
students' scores, but not always in the expected direction. Results also provide some limited evidence
that constructed response formatting and increasing syntactic complexity of items disadvantage ELLs.
The study has additional significance for the field as it highlights the promise of extending a similar
Coh-Metrix/MLM methodology to analysis of large-scale standardized content tests for ELLs.
Notes
Paper
Validating an Assessment Framework of Linguistic Knowledge for Teaching
Math and Science to English Language Learners
Sultan Turkan
Educational Testing Service
Jerry Bicknell
Educational Testing Service
One out of every five content teachers faces the challenge to teach content to English Language Learners
in comprehensible and accessible ways. Their challenge is closely associated with the degree to which
they are prepared to scaffold ELLs' academic achievement in the U.S. Preparing teachers to face the
challenge could be systematized through defining the essential knowledge base that teachers of ELLs
should be held accountable for when entering into the profession. This knowledge base needs solidifying
through empirical and evidence-based validity arguments.
The purpose of this study was to validate an assessment framework of linguistic knowledge teachers
need to teach math and science to English-language Learners (ELLs). To this end, the researchers
conducted a literature review, a national survey of teachers and teacher educators from districts with
high ELL populations, and convened a panel of expert teachers, researchers, and teacher educators. The
survey was rated by a national sample of 358 teachers and teacher educators on a 5-point rating scale
ranging from 1 (not at all important) to 5 (extremely important). As a result of these efforts, two primary
domains of knowledge emerged in the framework: knowledge of academic language and knowledge of
how to make content accessible to ELLs. Presenters unpack each domain exemplifying items that were
administered as part of a pilot test to middle school math and science teachers working at high ELL
school districts. Presenters discuss how a small sample of teacher test takers received the test items as
part of the cognitive interviews. Participants leave the session with an understanding of the challenges
and promises of designing assessment items of linguistic knowledge required for teaching math and
science to ELLs. This research is significant as it is in the intersection of how academic language is
defined and taught for content instruction in the U.S. context.
Notes
Paper
Assessing ELL Content in the Mainstream Classroom:
Teacher Decision-Making Processes
Beth Clark-Gareca
New York University
Steinhardt School of Culture, Education and Human Development
Content tests for English Language Learners (ELLs) are an increasingly valuable component in
students’ scholastic portfolio and provide critical information upon which academic tracking, promotion,
and remediation decisions are made. Though ELL content assessment has been investigated primarily
through test accommodations on high-stakes tests, to date, there is little known about how ELLs take
content tests in mainstream classrooms; this despite many state departments of education (e.g.
Pennsylvania, Texas, North Carolina, and Florida) relying on classroom accommodations practices to
set a precedent for standardized testing protocols. This descriptive study contributes to the nascent body
of knowledge relating to teacher practices when evaluating ELLs’ academic performance in content
areas. Data collection consisted of three classroom observations in ten 4th
grade classrooms; initial
observations were conducted of routine math and/or science instruction followed by two observations of
math and/or science tests. Observations were followed by teacher interviews designed to explore their
decision-making processes during classroom assessment as well as when evaluating ELL student work.
Inductive coding was used to identify themes in teacher responses relating to decision-making in
assessment, accommodations, and grading practices. Findings suggested that teachers have great
autonomy in their assessment practices, with accommodations implementation varying widely
depending on teacher evaluation of student needs. Teacher perception of student language proficiency
proved a consistent criterion upon which decisions were based, though certain high-stakes
accommodations such as providing bilingual/translated tests or dictionary use were not observed during
any content test administrations. Accommodated grading practices were common, primarily through
systems which weighted student participation and effort above academic achievement. Of note was that
in a majority of classrooms, accommodations implementation did not differ between ELLs and students
with special needs; a fact which calls into question overall teacher understanding of language acquisition
processes.
Notes
Paper
Differential Item Functioning in a High Stakes Test
Mohammad Salehi
Sharif University of Technology, Tehran, Iran
Alireza Tayebi
Sharif University of Technology, Tehran, Iran
Validation is an important enterprise especially when the test to be validated is a high stakes one.
Messick’s notion of construct irrelevant factors is pertinent in test validation. Demographic variables
like gender, field of study and age can affect test results and interpretations. A fair test needs to be
neutral when it comes to construct irrelevant factors such as gender. Differential item functioning is a
way of making sure that the test does not favor one group of test takers over the others. The current
study investigated differential item functioning (DIF) in terms of gender in the reading comprehension
subtest of a high stakes test using a three-step logistic regression procedure (Zumbo, 1999). This test is
made up of three sections of grammar, vocabulary and reading comprehension comprising 100 items
among which the last 35 items of reading comprehension were investigated. The participants of the
study were 3,398 test takers, both males and females, who took the test in question (the UTEPT) as a
partial requirement for entering a PhD program at the University of Tehran. In order to show whether
the 35 items of reading comprehension exhibited DIF or not, logistic regression using a three step
procedure (Zumbo, 1999) was employed. Three sets of criteria were selected, namely, Cohen's (1988),
Zumbo's (1999), and Jodin and Girel's (2001). It was revealed that, though the 35 items of the reading
section show "small" effect sizes according to Cohen's classification, they do not display DIF based on
the other two criteria. Therefore, it can be concluded that the reading comprehension subtest of the
UTEPT favors neither males nor females.
Notes
Paper
Evaluating Oral Collocational Production to Predict L2 Oral Proficiency
Jing Xu
Iowa State University
The state-of-the-art automated speech evaluation (ASE) systems such as SpeechRater and Versant use a
subset of the criteria evaluated by human raters to predict human scores (Weigle, 2010; Xi, 2010).
However, for more accurate score prediction, the construct coverage of these systems need to be
expanded to include additional speech features that are construct-relevant and measurable (Xi, et al.,
2008). Collocations play a scaffolding role in building up oral language but these formulaic expressions
have been found to pose serious problems for second language (L2) learners (e.g., Ellis, 2008). Many
researchers have argued that L2 learners’ collocation use in spontaneous speech is a good indicator of
their oral proficiency (e.g., Handl, 2008). Hence, oral collocation production, if measured appropriately,
might be useful in improving ASE systems’ predicting power. The present study explores several ways
of evaluating collocations produced by L2 speakers and examines the extent to which appropriate
collocation use contributes to oral proficiency. Twenty Chinese learners of English representative of
four oral proficiency levels were randomly selected from a spoken corpus comprising speech samples of
an institutional oral English test. From the transcriptions of these speech samples, collocation strings of
10 syntactic patterns were manually extracted. These strings were then coded by six trained native-
English-speaking linguists for accuracy, semantic transparency, and frequency and by two non-native
linguists for difficulty. Based on this human coding, collocations produced by test takers of different
oral proficiency levels were compared. Further, collocation features deriving from the human coding
(e.g., difficult collocations attempted per 100 words, ratio of accurate collocations) were included in a
regression model to predict the speakers’ holistic test scores given by human raters according to a
comprehensive rating rubric covering pronunciation, vocabulary, and fluency.
Notes
Paper
Assessing Lower-order and Higher-order Listening Skills for ESL Students
Tatiana Nekrasova-Becker
Second Language Testing, Inc.
Anthony Becker
Second Language Testing, Inc.
Designing second-language (L2) tests to measure students' listening comprehension can be quite
challenging, as L2 teachers and/or test developers must consider the assessment of various lower-order
(e.g., identifying main ideas) and higher-order skills (e.g., inferencing or interpreting speakers’
attitudes). Often times, L2 listening tests are resigned to measuring lower-order skills, as they are
generally easier and more familiar to assess. However, neglecting higher-order skills, which are essential
for successful listening, can result in a narrowing of the listening construct, as it leaves out important
linguistic devices that speakers use to convey meaning to listeners (Buck, 2001; Wagner, 2004). In order
to confidently measure students’ listening comprehension, L2 listening tests need to maintain a balance
between the different types of lower-order and higher-order skills.
This empirical study investigated the performance of 87 examinees studying English as a second
language at an Intensive English Program (IEP) in the United States. Based on their responses to 30
listening items used in an IEP placement test, examinees’ performance on items measuring lower- and
higher-order listening skills was analyzed in the two proficiency groups. Specifically, the examinees’
scores for items measuring lower- and higher-order listening skills were compared for the two
proficiency groups and then correlated with their overall listening scores. The results indicated that,
while participants’ performance on both types of items (i.e., lower- and higher-order listening skills)
successfully distinguished between the two proficiency groups, the scores in both groups for items
measuring higher-order listening skills were more closely related (than scores for lower-order listening
skills) to the overall listening scores. The implications of these findings can help to raise awareness for
including both lower-order and higher-order skills in L2 listening tests, resulting in more informed
decisions that can lead to better assessment of L2 students’ listening comprehension.
Notes
Paper
Building Evidence for the Evaluation of English Learners’ Writing Scores
Anthony Becker
Second Language Testing, Inc.
Performance-based assessment (PBA) has become the primary means for assessing L2 learners’ writing
abilities. Despite its popularity in L2 writing, validation is a contentious issue in PBA, as scoring rubrics
are rarely entered as evidence for the appropriateness of decisions made from test scores (Cumming et
al., 2004; Leung & Lewkowicz, 2006). In an argument-based approach to validity, the investigation of
scoring rubrics is an important aspect of a major inference, evaluation. Evidence that supports the
evaluation inference is crucial in the validation process, since decisions will likely be less valid if
scoring rubrics are not adequately constructed and appropriately used (Kane, 2006). To provide
evidence for the evaluation inference, five assumptions should be supported: a) rubrics capture relevant
aspects of performance at different score levels; b) attempts were made to standardize scoring
procedures; c) raters know how to implement rubrics; d) experienced and novice raters score students’
writing similarly; and e) rubrics function appropriately with their existing scales.
This research study was conducted in 2009-2010 whose purpose was two-fold: 1) to investigate the
quality of scoring rubrics used to assess L2 students’ writing ability at four Intensive English Programs
(IEPs) and 2) to determine the value of a proposed framework for examining the evaluation inference.
The study incorporated a multiple case-study methodology, whereby quantitative and qualitative
evidence was collected. The results indicated that rubrics captured relevant aspects of writing
performance and that the rating scales used in the rubrics functioned appropriately to distinguish
different levels of performance, despite the finding that teacher-raters could have benefitted from
additional scorer training. Also, the results of the investigation of the evaluation inference demonstrated
that IEP administrators perceived the framework as being coherent and useful, but that it could have
been more adequate and implementable.
Notes
Paper
Textual Borrowing and Rater Perceptions in Integrated Writing Tasks
Sara Cushing Weigle
Georgia State University
Megan Montee
Georgia State University
Integrated assessment tasks are intended to more closely reflect language use in real-world academic
settings than tasks that measure only one skill. In the case of integrated reading and writing tasks, test
takers base their writing on one or more input reading texts. While integrated tasks offer important
benefits in terms of task authenticity, their use also raises issues about how test takers incorporate the
ideas and language from source texts into their writing. Textual borrowing refers to the direct use of
language from the source text. While previous research (Cumming et al., 2005; Weigle & Parker, 2010)
has looked at patterns of textual borrowing in writing assessment, no research to date has examined how
test raters perceive source-based writing and how textual borrowing may affect their ratings. Textual
borrowing may lead to inaccurate ratings by masking students’ writing proficiency (Weigle, 2002). In
addition, issues of what raters perceive as appropriate and inappropriate source-based writing, and the
extent to which these perceptions reflect the expectations of real-world academic writing, are essential to
task authenticity. In the context of second language writing and writing assessment, source-based
writing also raises questions of cultural differences in how writers use sources and attribute source
material.
This paper presents the results of an exploratory study of how test raters identify, perceive and make
scoring decisions about textual borrowing. The context of the study is a locally developed writing exam
used for placement in English as a Second Language courses. Data collection included focus groups and
stimulated recalls with test raters. Raters also completed a judgment task about the acceptability of
instances of textual borrowing in essay exams. This paper presents results from the study focusing on
implications for rubric design and rater training.
Notes
Paper
The Development of Specifications for an Oral Proficiency Test:
Contributions to Validity Claims
Francesca Di Silvio
Center for Applied Linguistics
Anne Donovan
Center for Applied Linguistics
Beth A. Mackey
Visiting Scholar, Center for Applied Linguistics
Detailed test specifications contribute to a validity argument by documenting what the test purports to
measure (Bachman & Palmer, 1996; Davidson & Lynch, 2002; Hughes, 1991). Bearing in mind the
audience of potential test users and test takers in specification design ensures consideration of issues of
consequential validity during the development process. Documented test specifications also facilitate
development of additional test items and test forms that are consistent in characteristics. Furthermore, an
iterative process of specification development based on feedback from internal review, field tests, and
operational use provides a clear record of evidence for validity claims (Bachman & Palmer, 1996;
Fulcher, 2003).
This paper describes the development of specifications for a computer-delivered oral proficiency test for
language learners of high school age and above that is aligned to the ACTFL Proficiency Guidelines—
Speaking. Specifications were designed based on operational tests in two languages that have been
shown to elicit results comparable to a preceding tape-mediated test of oral proficiency (Kenyon &
Malabonga, 2001) that demonstrated high correlations with the ACTFL OPI (Stansfield, 1990). The
purpose of the project was to write and pilot specifications to facilitate efficacious development of tests
in additional languages, with a particular focus on the item writing process which has been infrequently
addressed in the literature (Kim et al., 2010).
The paper presents methodology used to develop specifications consistent with Popham’s test
specification model (1978), including procedures for task writing, review, and banking. Presenters will
emphasize how the iterative process of test specification development contributes to validity claims.
Presenters will also discuss the issue of specificity in view of lessons learned during task writing and
implications for operational development. The iterative methodology presented may be replicated by
other language testers to build a validity argument as well as a foundation to support efficient test
development.
Notes
Paper
Assessing Learning Outcomes in Short-term Foreign Language Programs:
Validation Results of a Triangulated Assessment System
Megan C. Masters
University of Maryland
Steven J. Ross
University of Maryland
Margaret E. Malone
Center for Applied Linguistics
There are currently no nationally-recognized, standardized assessment tools available, particularly in critical
languages, to document the language learning gains for beginning-level language learners participating in
short-term programs. To assess the effectiveness of short-term foreign language programs, reliable and valid
measures of learning outcomes are needed, especially since commonly used assessment tools, such as the
Oral Proficiency Interview (OPI), are not granular enough to document learner progress after only a few
weeks of study.
The current paper describes the results of a national, two-year study (N=700) of three different assessment
tools piloted for use to inform the development of a triangulated assessment system. The purpose of the
assessment system was to examine the reliability, consistency across language programs and convergent
validity of three instruments developed as indicators of learning outcomes. Participants were students in
grades 9-12 who were enrolled in short-term Arabic, Chinese and Hindi language programs. The programs
were designed to initiate and sustain interest in language study and to foster proficiency in strategically
important languages. The three assessment tools--the first primarily objective and the remaining subjective--
included: (1) a computerized proficiency test of four language skills, (2) a student self-assessment and (3) a
teacher assessment of student performance. This paper will explore the results of the pilot assessment
system, including the correlations between the three assessments on measuring student proficiency as well as
the outcomes of Rasch rating scale/mixed-scale modeling, which allows for direct comparison across the
three assessment instruments. The implications of these results will be examined in light of their provision of
empirically grounded evidence of student performance and program quality. Results will include an analysis
of cross-program comparability of outcomes, the evaluation of the impact of training on the alignment of the
triangulated assessment system and the identification of criteria to be considered for diagnostic feedback to
program directors and students on their use of the assessment instruments.
Notes
Paper
Investigating the Construct Validity of the Grammar Sub-Test of the CEP
Placement Exam
Payman Vafaee
Teachers College, Columbia University
Nesrine Basheer
University of Maryland
Reese Heitner
Teachers College, Columbia University
An important assumption in language testing is that test items or observable variables correspond to the
structural relations hypothesized in the theoretical model or constructs governing the design of the
testing instrument (e.g., Shin, 2005). Accordingly, the purpose of the present construct validity study
was to investigate the extent to which scores from the grammar sub-test of the Columbia University
Community English Program (CEP) placement test could be interpreted as indicators of test takers’
grammatical knowledge. In the current study, we adopted Purpura’s (2004) model, which hypothesizes
that grammatical knowledge consists of two underlying factors of form and meaning. To this end, we
conducted a confirmatory factor analysis to investigate if the data from this test could fit this theoretical
model. In addition, since the test items were not discrete point but were nested within one of four tasks
(each with their own theme), the interactionist effects of these four themes on individual items were
also investigated. The data for this study was collected from the administration of the test to 144
participants. In preparation of the data for CFA, descriptive statistics were examined, and reliability
analysis and exploratory factor analysis were conducted. Upon examining several CFA models, a
multi-trait multi-method model achieved the best possible model-fit in accordance with substantive
considerations and issues of parsimony. In conclusion, this full-latent model, which included two trait
factors of grammatical form and meaning and four method factors, confirmed that the CEP test
examined the grammatical knowledge proposed in the hypothesized theoretical model. These findings
contribute to recent discussion concerning the importance of both construct definitions and method
effect in testing L2 grammatical knowledge.
Notes
Poster abstracts listed in alphabetical order Poster
Investigating Assessment Literacy through Teacher Professional Development
Aileen Bach
Center for Applied Linguistics
Anne Donovan
Center for Applied Linguistics
This poster will present findings of a study of assessment literacy of teachers of less commonly taught
languages; it describes the changes in instructor knowledge during a blended learning assessment course
that combines online and face-to-face formats. Assessment literacy is defined as the knowledge and
skills teachers need to accurately and effectively plan for and administer assessments and interpret and
apply the results (Boyles 2005; Taylor, 2009). Effective classroom assessment is an important
component of today’s foreign language classroom, because it provides insight for teachers on how to
improve teaching and learning in their classroom (Shepard, 2000). The nine-week course described in
this poster was designed for in-service teachers of less commonly taught languages. These teachers
usually work with limited teaching resources and often have limited training in pedagogy (Wan, 2009),
thus creating a great need for assessment literacy in this population.
This study provides insight into how to cultivate assessment literacy by tracking the progress and depth
of understanding of course participants over time by analyzing data from various sources. Data were
collected through pre- and post-course surveys, assessment tasks developed during the course,
discussion board posts, and observations. These data also explore what LCTL teachers know about
assessment, how their knowledge emerges through assessment practices, and what opportunities exist
for providing teacher professional development in order to strengthen their use of assessments. Since
this study explores the evolution of LCTL teachers’ assessment knowledge, beliefs, and practices, it
provides an important and often-overlooked foundation for the development of assessment literacy
training.
References
Boyles, P (2005) Assessment literacy. In Rosenbusch, M. (Ed.), National Assessment Summit Papers
(pp.11-15). Ames, Iowa: Iowa State University.
Shepard, L. (2000). The role of assessment in a learning culture. Educational Researcher, 29(7), 4-14.
Taylor, L. (2009). Developing assessment literacy. Annual Review of Applied Linguistics, 29, 21–36.
Wan, S. (2009). Preparing and supporting teachers of Less Commonly Taught Languages. Modern
Language Journal, 93(2), 282-287.
Notes
Poster
The Design of an Oral Test for Foreign Language Teachers:
Validating the Vocabulary Descriptors
Melissa Alves Baffi-Bonvino Universidade Estadual Paulista - Brazil
This poster will report on the results of a six-year research study about the process of assessing
vocabulary oral production in English as a Foreign Language (EFL) of undergraduate students in a pre-
service teacher education course at a public university in Brazil, who were preparing to enter the field of
English language teaching (ELT). It aimed to analyze students’ oral proficiency considering vocabulary
use in order to contribute for the validation of the vocabulary descriptors. The research is interpretive
and data analysis was developed by means of qualitative and quantitative procedures. Lexical
proficiency was assessed by means of an oral test designed as a pilot test (TEPOLI) and mock speaking
tests of the high stakes exams FCE and IELTS, whose approach to speaking is grounded in Bachman’s
(1990), Canale and Swain’s (1980) and Canale’s (1983) communicative competence models (Galaczi &
Khalifa, 2009). Individual interviews and questionnaires were also used to collect data, so that the
perceptions of the individuals are considered to rely on a larger view of the context. The analysis of
language data produced during the tests was carried out with the RANGE program (Victoria University
of Wellington), and a statistical analysis. The overall results indicate similarities in the levels of
proficiency obtained when data were compared and show that the connections between vocabulary and
oral proficiency can contribute to validate the vocabulary descriptors of TEPOLI by means of more
objective criteria to assess lexical oral proficiency. The study was carried out within a larger research
project about the process of validating a language proficiency examination for foreign language
teachers, grounded in the need to intervene in the Brazilian context of language teacher education to
establish clear criteria to assess language proficiency, in the language domain required for those
teachers.
Notes
Poster
Withdrawing from the Bank: Item Stability across Multiple Testing Contexts
Martyn Clark Center for Applied Second Language Studies
One of the advantages of a Rasch-calibrated item bank is that it allows the creation of a wide variety of
tailored tests geared towards particular objectives yet still related to each other (Wright & Stone, 1999).
The Chinese Computerized Assessment of Proficiency (CAP) is a web-based, low-stakes test of Chinese
proficiency developed to provide useful feedback to teachers and language programs. Items for the
Chinese CAP were developed and calibrated using Rasch analysis of pilot test data (N=600) and expert
review. Two new versions of the test were created for different contexts using a subset of these
calibrated items.
This presentation will briefly discuss the selection and calibration of items in the CAP item bank, and
then present a comparison with the results of two alternate ―forms‖ of the test drawn from the same
calibrated bank. In the first instance, a shortened version of the test was created for beginning students
studying Chinese in summer StarTalk programs (N=249) as part of a triangulation study between student
self-assessment, teacher retrospective assessment of student ability, and CAP scores. For this version of
CAP, items were chosen that targeted lower proficiency levels and matched the self-assessment ―Can
Do‖ statements common to the other instruments in study. In the second instance, a subset of items
geared towards intermediate-level proficiency was chosen to provide a pre/post measure of language
ability for a small group of students (N=19) participating in a summer study abroad program in China.
Item difficulty stability across these testing contexts is investigated through comparison of independent
Rasch item calibrations from each context with the original item bank values. Given that available
options for testing for less commonly taught languages are usually relatively scarce, this presentation
should be of interest to others facing similar situations.
Notes
Poster
Bridging the Gap: How Language Testers can Build Assessment Literacy in
Practitioners of Other Fields
Anne Donovan
Center for Applied Linguistics
Margaret E. Malone
Center for Applied Linguistics
Francesca Di Silvio
Center for Applied Linguistics
Megan Montee
Georgia State University
Research on language assessment literacy, or the knowledge users need to have about language
assessment to make informed decisions (Inbar-Lourie, 2008; Taylor, 2009), has focused primarily on
instructional contexts and the needs of language teachers to conduct fair and valid assessments of their
students. However, language assessment literacy is important across a variety of contexts beyond
traditional language testing courses conducted for preservice teachers. This poster presents the initial
results of a multi-year project that examines the assessment literacy needs of two groups: Second
Language Acquisition (SLA) researchers and professionals who work in Language Teacher Education
(LTE). The project investigates the current assessment knowledge base of both groups to identify what
resources should be developed to meet the needs of these audiences. For the SLA context, the project
focuses on how collaboration can be facilitated between the fields of SLA and language testing in order
to support SLA researchers’ use of language tests in research contexts. For the LTE context, the project
focuses on the essential knowledge of assessment needed to provide appropriate pre- and in-service
professional development for language instructors. The project will develop assessment literacy
resources for each group.
This poster presents results from four focus groups conducted with SLA and LTE researchers and
practitioners. Focus groups were recorded, transcribed, and analyzed for major themes related to current
practices, challenges, and resources needed in language testing. The results of the focus groups will
inform the development of future phases of the project, including surveys of both groups’ assessment
needs. In addition to presenting the focus group results, the poster will discuss implications for
collaboration between language testers and other language professionals, and will raise questions about
the responsibility of the language testing community to make testing research accessible to colleagues in
related fields.
Notes
Poster
Transcription as a Language Testing Tool
Beth A. Mackey
U.S. Department of Defense
Foreign language specialists in the government find themselves constantly tested both formally and
informally, in the classroom and for professional certification. Informal language tests are created by
teachers for the pedagogical purposes of providing feedback to teacher and student, reaffirming and
measuring what learning has occurred and motivating students to retain and expand their language skills.
Federal agencies develop formal tests to assess foreign language proficiency (primarily in speaking as
well as in reading and listening comprehension) as a basis for making employment and career
development decisions. Government agencies also use task-based testing formats such as translation and
transcription (both verbatim and listening summaries) to measure the specific application of language
skills in a job-related context. While the use of translation as a measure of language proficiency has been
addressed in the literature (e.g., Buck, 1992), research on verbatim transcription in the foreign language
testing arena has been absent. This study (n=197) draws upon data collected by the Spanish Department
of a government language school. The dataset includes subscores in translation, transcription, and cloze.
Standardized tests of reading and listening proficiency are also available, allowing for a more thorough
exploration of transcription, translation, and their relationship with listening and reading skills. This
poster will explore transcription as a measure of listening comprehension, including correlation and
regression tables from the Spanish test results and qualitative input from an earlier survey of
transcription tests. The lack of support in the testing literature suggests that this topic deserves further
investigation.
Notes
Poster
Task Complexity Features and Speaking Test Performance
Megan Montee
Georgia State University
Task complexity has become increasingly important in both Second Language Acquisition (SLA) and
language testing research, motivated in large part by the importance of language tasks for classroom
teaching and learning (Kim, 2009). However, the applications of SLA task-based research for language
testing are still unclear (Robinson, 2011) and there is a need for additional analysis of the relationship
between task features and language output in performance assessment tasks.
This poster presents the results of an exploratory study of task features and linguistic performance,
operationalized in terms of complexity, accuracy and fluency (CAF) measures. Data from the study
comes from a small corpus of transcribed responses to tasks from the WIDA ACCESS for ELLs
Speaking Test, an English language proficiency test used in U.S. public schools. The eight picture-based
tasks included in this study are intended to assess academic language use. However, there has been no
published research to date that systematically examines the relationship between the task features and
linguistic performance across various proficiency levels on the test. To explore this issue, the data
analysis for this study included coding each task for complexity features according to Robinson’s (2007)
task complexity framework. Student responses were then coded for multiple CAF measures, and these
results were compared with the task features as well as the scoring specifications for each task.
Results of this exploratory study have several implications. First, test developers may find these results
suggestive of the ways task features can be altered to elicit variations in examinee performance. Next,
the results indicate future directions for task-based research. Based on the results of the exploratory
analysis, this poster will present a proposed program of research to further explore task characteristics
and language production on the ACCESS Speaking Test.
Notes
Poster
Best Practices in Pilot Testing a High-stakes English Language Proficiency Test
Abbe Spokane
Center for Applied Linguistics
Tiffany Yanosky
Center for Applied Linguistics
Pilot testing—pre-operational research that is often small-scale and qualitative, and is intended to
identify necessary revisions in items—is a widespread practice but has little documentation in literature.
This poster aims to document and disseminate pilot testing methods for a high-stakes, large-scale
assessment of English language proficiency using the framework of Bachman’s (2005) Assessment Use
Argument (AUA). The poster will address several research questions: Why is pilot testing important for
language test development? What pilot testing methods do the presenters use? What defines these
methods as best practices? How do theoretical foundations such as Bachman’s AUA structure and
support pilot testing practices? The AUA, as applied to language testing, ―is an overall logical
framework for linking assessment performance to use (decisions)‖ (Bachman, 2005, p. 1). In pre-
operational testing, test developers claim that performance on new test items is based on the ability of
test takers rather than a construct-irrelevant variable and that the items should become operational. Data
collected during pilot testing consists of test takers’ responses to and feedback about items and serves as
backing to the original claim or to various rebuttals of that claim. The quality of the backing for the
claim and rebuttals leads to decisions about inclusion of the new test items on the operational test and
how the scores on those items should be interpreted and used to take actions related to test takers, such
as placement in programs or classes (Kenyon & MacGregor, in press). Adopting common terminology
for pre-operational practices, supporting pilot testing methods with theoretical models, and sharing best
practices would serve to enhance test validity arguments. We believe sharing our procedures will
encourage other test developers to do the same and will increase the use and quality of pilot testing and
produce more ethical and valid language tests.
Notes
Poster
Implementing Cognitive Diagnostic Assessment in an Institutional Test through
Collaboration of Language Testers: A New Networking Model in Language Testing
Yeon-Sook Yi
University of Illinois at Urbana-Champaign
Stephanie Gaillard University of Illinois at Urbana-Champaign
Cognitive diagnostic assessment (CDA) has gained attention in language testing since the late 90’s,
yielding encouraging results in general. However, all empirical studies used large-scale, standardized
tests. We attempt to expand this previous context of CDA by applying it to a college French placement
test, which will be the first application of CDA to institutional level language test development.
In order to do so, we report on a unique collaboration of researchers. A graduate student specializing in
measurement/language testing brings theoretical and empirical knowledge of CDA to the project, while
another student majoring in French/language testing provides linguistic expertise from the initial phase
of identifying attributes through the iterative process of refining the Q-matrix and final score reporting.
The ultimate results of this collaboration will be used to assign students to appropriate levels of language
class and the fine-grained feedback about students’ performance will be utilized in the classroom
teaching. More French teachers and second language learners are also involved in the project: Graduate
teaching assistants of French will participate in identifying language attributes and constructing a Q-
matrix and learners of French will contribute to specifying attributes of the test items through verbal
protocol reports.
We foresee that this innovative collaboration model can extend to other foreign language testing at our
university, hoping that the positive result of this research will establish very important empirical
evidence that different types of language tests in different settings can also benefit from the strength of
this new testing approach. We expect that this collaboration model will also help make the CDA method
more accessible to a wider public of language teachers and testers, a method that has been deemed as a
technically challenging multi-step procedure used only by a limited number of interested groups.
Notes
What is the East Coast Organization of Language Testers?
The East Coast Organization of Language Testers (ECOLT) represents an East Coast
group of professionals, scholars, and students who are involved in language testing
projects and research. One of the organization’s goals is to support connections between
academia, government, and testing organizations. In addition to providing a forum for
continued learning and networking, ECOLT strongly supports the work of graduate
students.
For more information about ECOLT, contact:
Dr. Margaret E. (Meg) Malone, Center for Applied Linguistics at: [email protected]
Program printed courtesy of Second Language Testing, Inc.