validity: opportunities & challenges - aubrey logan-terry...employs a computerized readability...

East Coast Organization of Language Testers

Conference 2011

Friday, October 28

th

Saturday, October 29th

Georgetown University

Washington, DC, USA

Validity:

Opportunities & Challenges

National Capital Language Resource Center

Center for Applied Linguistics

Second Language Testing, Inc.


ECOLT Conference Committee

Tim Farnsworth, Program Chair, Hunter College, the City University of New York

Margaret E. (Meg) Malone, Conference Co-chair, Center for Applied Linguistics

Francesca Di Silvio, Conference Co-chair, Center for Applied Linguistics

ECOLT Abstract Reviewers

Rachel Brooks, Federal Bureau of Investigation

Jee Wha Dakin, Oxford University Press

Talia Isaacs, University of Bristol

Lorena Llosa, New York University

Margaret Malone, Center For Applied Linguistics

Maria McCormack, Teachers College, Columbia University

Christine Rosalia, CUNY Hunter College

Sun-Young Shin, Indiana University at Bloomington

Elvis Wagner, Temple University

The Conference Committee would like to thank the following

individuals and organizations for their support of ECOLT 2011:

Dorry Kenyon, Anne Donovan, and Aileen Bach, Center for Applied Linguistics

Mackenzie Price and Colleen Moorman, Georgetown University

Scott McGinnis, Defense Language Institute

Charles Stansfield, Second Language Testing, Inc.



The Testing Committee of the Interagency Language Roundtable

Special thanks to Jeff Connor-Linton, Georgetown University, and Anup Mahajan,


Agenda

Friday, October 28, 2011

All paper presentations will be given in the Intercultural Center (ICC) Auditorium, 2nd

floor. Please note

that the main entrance to the ICC is on the 3rd

floor.

Refreshments will be located in the foyer of the auditorium.

Poster presentations will be located in ICC room 107, 1st floor.

8:30-9:30 Registration

9:30-9:45 Introductions

9:45-10:45 ILR Panel

10:45-11:00 Break

11:00-12:00 Plenary Presentation

Validity Arguments for Language Assessment: Opportunities and Challenges

Carol A. Chapelle

12:00-12:30 Paper Session 1 (1 paper)

The Assessment of Reading Comprehension Skills for Immigrants: The Case of the

Netherlands

Ryan Downey, Jo Fond Lam, Alistair Van Moere

12:30-1:30 Lunch

1:30-3:00 Paper Session 2 (3 papers)

Relationships between Linguistic Complexity Measures and Student Scores on a 6th

Grade Science Assessment

Aubrey Logan-Terry, Timothy Farnsworth

Validating an Assessment Framework of Linguistic Knowledge for Teaching Math and

Science to English Language Learners

Sultan Turkan, Jerry Bicknell

Assessing ELL Content in the Mainstream Classroom: Teacher Decision-making

Processes

Beth Clark-Gareca

3:00-3:15 Break


Differential Item Functioning in a High Stakes Test

Mohammed Salehi, Alireza Tayebi

Evaluating Oral Collocational Production to Predict L2 Oral Proficiency

Jing Xu

4:15-4:30 Break/Poster Set Up

4:30-6:00 Poster Session (8 posters)

Investigating Assessment Literacy through Teacher Professional Development

Aileen Bach, Anne Donovan

The Design of an Oral Test for Foreign Language Teachers: Validating the Vocabulary

Descriptors

Melissa Alves Baffi-Bonvino

Withdrawing from the Bank: Item Stability across Multiple Testing Contexts

Martyn Clark

Bridging the Gap: How Language Testers Can Build Assessment Literacy in

Practitioners of Other Fields

Anne Donovan, Margaret E. Malone, Francesca Di Silvio, Megan Montee

Transcription as a Language Testing Tool

Beth Mackey

Task Complexity Features and Speaking Test Performance

Megan Montee

Best Practices in Pilot Testing a High-stakes English Language Proficiency Test

Abbe Spokane, Tiffany Yanosky

Implementing Cognitive Diagnostic Assessment in an Institutional Test through

Collaboration of Language Testers: A New Networking Model in Language Testing

Yeon-Sook Yi, Stephanie Gaillard

Saturday, October 29, 2011

8:30-9:30 Registration


Assessing Lower-order and Higher-order Listening Skills for ESL Students

Tatiana Nekrasova-Becker, Anthony Becker

Building Evidence for the Evaluation of English Learners’ Writing Scores

Anthony Becker

Textual Borrowing and Rater Perceptions in Integrated Writing Tasks

Sara Cushing Weigle, Megan Montee

11:00-11:15 Break


The Development of Specifications for an Oral Proficiency Test: Contributions to

Validity Claims

Francesca Di Silvio, Anne Donovan, Beth Mackey

Assessing Learning Outcomes in Short-term Foreign Language Programs: Validation

Results of a Triangulated Assessment System

Megan C. Masters, Steven J. Ross, Margaret E. Malone

Investigating the Construct Validity of the Grammar Sub-Test of the CEP Placement

Exam

Payman Vafaee, Nesrine Basheer, Reese Heitner

Paper abstracts listed in presentation order

ILR Panel

Standard Setting in the Department of Defense (DoD)

The ILR Testing Committee

The ILR Testing Committee will host a panel discussion on standard setting. Standard setting is a well-

established method in educational testing for having experts determine what kind of performance on a

test demonstrates a given level of competence. Determining how to use standard-setting

recommendations is a key issue in building a validity argument.

Representatives from several agencies will discuss their roles in the standard setting process, moving

through the chronology of a standard setting workshop. The process as implemented by the Defense

Language Institute Foreign Language Center (DLIFLC) begins with the workshop participants, who

represent a large cross-section of stakeholders. Participants are trained in general Borderline Proficiency

Level Descriptions (BPLDs), which are statements derived from the ILR Skill Level Descriptions to

capture the minimum performance expectations for each level. Standard-setting participants

independently make a judgment for each item on the standard-setting test form about whether examinees

at each borderline would or would not know the answer to that item. Statisticians at the DLIFLC review

the results upon completion of the panel; information is compiled and cut scores are transformed onto

the Item Response Theory theta scale.

Panelists will include stakeholders, standard-setting participants, standard-setting planners/organizers,

Defense Language Proficiency Testing program management, and statisticians. The goal of the ECOLT

discussion is to provide a forum for discussion of a real-world standard setting activity, drawing upon

various perspectives. Because of the high-stakes nature of the Defense Language Proficiency Testing

System, constant improvement to testing is a critical component in ensuring test validity. We expect that

this topic will appeal to the ECOLT audience as it represents the intersection of theory and practice.

Plenary

Validity Arguments for Language Assessment: Opportunities and Challenges

Carol A. Chapelle

Iowa State University

Approaches to validity arguments making their way into language assessment offer some important

advances. Kane (2006) and others working in this area provide basic conceptual tools which are aimed,

in part, at challenging test developers and users to make explicit the intended interpretations and uses of

test results in addition to the evidence and rationales that support such interpretations and uses. The

overall goal of reaching a well-justified conclusion (about test interpretation and use) is similar to past

practices in test validation; what is different is the specific conceptual infrastructure offered for doing so

(Chapelle, Enright, & Jamieson, 2010).

I will suggest that such tools for developing validity arguments are important for language testing

because of their role in prompting clear and detailed interpretive arguments, which point to research

needed to support the argument’s claims. At the same time, the opportunity to include such delicate

levels of specification, evidence and rationales raises challenges which may help to push applied

linguistics farther.

I will provide examples of some of the opportunity/challenge points that language testers face as they

use the tools of current validity argument approaches. One is the potential need that arises for an

inference in the argument to warrant the sampling of specific linguistic features in test development. A

second is the need for substantial evidence supporting scoring rules on constructed response tests. A

third is the need to converge interpretive arguments from multiple assessment procedures to warrant a

decision. These issues are not necessarily unique to language assessment, but here they become apparent

because of the detail with which language ability can be defined and the need to specify the relevant

domain of language in most interpretive arguments.

References

Chapelle, C. A., Enright, M. E., & Jamieson, J. (2010). Does an argument-based approach to

validity make a difference? Educational Measurement: Issues and Practice, 29(1), 3-13.

Kane, M. (2006). Validation. In R. Brennan, (Ed.), Educational Measurement (4th Edition), (pp 17-64).

Westport, CT: Greenwood Publishing.

Paper

The Assessment of Reading Comprehension Skills for Immigrants:

The Case of the Netherlands

Ryan Downey

Pearson Knowledge Technologies

Jo Fond Lam

CINOP

Alistair Van Moere Pearson Knowledge Technologies

The use of language assessments for immigration screening has wide-ranging implications for validity

and score use. This paper analyses the case of the Netherlands, where candidates for immigration must

currently pass three tests which are administered in embassies worldwide: an oral test of knowledge

about Dutch culture, a test of oral proficiency in Dutch language, and a newly-implemented test of basic

literacy and reading comprehension in Dutch. All three tests are automated and delivered by telephone;

the candidate hears recorded test questions or prompts over the telephone, and the candidate’s spoken

responses are scored using speech processing technology.

The new test – the "Toets Geletterdheid en Begrijpend Lezen" ("Test of Literacy and Reading

Comprehension") was developed under the assumption that social integration will be more successful if

immigrants have basic (CEFR A1 level) reading comprehension skills in Dutch. In some sections of the

test the candidate is instructed to read aloud from a test paper and in other sections the candidate must

read silently and then give spoken answers to written comprehension questions.

There were many challenges to constructing a validity argument for the Dutch reading test. This

presentation discusses the development of the test in accordance with the policy requirements of the

government of the Netherlands, and then describes several studies aimed at providing valid

interpretations of test scores. Important questions pertaining to validity are raised and answered,

including: What are the challenges to the validity of a reading test which must elicit oral responses to

comprehension questions, and how are they overcome? Why was the construct of ―literacy‖ defined by

the government and how is it assessed? Can the machine scoring system disassociate construct-

irrelevant abilities, such as pronunciation? And finally, where is the line drawn between ethics of

language test construction and policy implementation?

Paper

Relationships between Linguistic Complexity Measures and Student Scores

on a 6th Grade Science Assessment

Aubrey Logan-Terry


Timothy Farnsworth

Hunter College, City University of New York

The validity of English language content tests for English language learners (ELLs) has recently been an

area of intense research interest, because researchers have argued that language serves as a source of

construct-irrelevant variance in these tests (e.g. Abedi, 2006; Abedi, Courtney, & Leon, 2001; Abedi &

Lord, 2001; Farnsworth, 2006; Martiniello, 2009; Staley, 2005; Wolf & Leon, 2009); however, other

recent research suggests that linguistic aspects of test items may be construct-relevant (Bailey, 2005;

Farnsworth, 2008) and differences in performance across groups may be due to factors other than

language barriers in the tests (Bachman & Koenig, 2004; Elliot, 2008; Ockey, 2007). The present study

employs a computerized readability tool (Coh-Metrix) and Multilevel Modeling (MLM) to investigate

whether linguistic complexity of test items is predictive of students' item-level scores, and whether the

relationships vary between ELLs and non-ELLs, thereby shedding light on possible language barriers to

performance in the tests. Preliminary findings from analysis of a corpus of 6th grade classroom science

tests (n = 852 students) indicate that linguistic complexity of test items is an important predictor of all

students' scores, but not always in the expected direction. Results also provide some limited evidence

that constructed response formatting and increasing syntactic complexity of items disadvantage ELLs.

The study has additional significance for the field as it highlights the promise of extending a similar

Coh-Metrix/MLM methodology to analysis of large-scale standardized content tests for ELLs.

Paper

Validating an Assessment Framework of Linguistic Knowledge for Teaching

Math and Science to English Language Learners

Sultan Turkan

Educational Testing Service

Jerry Bicknell

Educational Testing Service

One out of every five content teachers faces the challenge to teach content to English Language Learners

in comprehensible and accessible ways. Their challenge is closely associated with the degree to which

they are prepared to scaffold ELLs' academic achievement in the U.S. Preparing teachers to face the

challenge could be systematized through defining the essential knowledge base that teachers of ELLs

should be held accountable for when entering into the profession. This knowledge base needs solidifying

through empirical and evidence-based validity arguments.

The purpose of this study was to validate an assessment framework of linguistic knowledge teachers

need to teach math and science to English-language Learners (ELLs). To this end, the researchers

conducted a literature review, a national survey of teachers and teacher educators from districts with

high ELL populations, and convened a panel of expert teachers, researchers, and teacher educators. The

survey was rated by a national sample of 358 teachers and teacher educators on a 5-point rating scale

ranging from 1 (not at all important) to 5 (extremely important). As a result of these efforts, two primary

domains of knowledge emerged in the framework: knowledge of academic language and knowledge of

how to make content accessible to ELLs. Presenters unpack each domain exemplifying items that were

administered as part of a pilot test to middle school math and science teachers working at high ELL

school districts. Presenters discuss how a small sample of teacher test takers received the test items as

part of the cognitive interviews. Participants leave the session with an understanding of the challenges

and promises of designing assessment items of linguistic knowledge required for teaching math and

science to ELLs. This research is significant as it is in the intersection of how academic language is

defined and taught for content instruction in the U.S. context.

Paper

Assessing ELL Content in the Mainstream Classroom:

Teacher Decision-Making Processes

Beth Clark-Gareca

New York University

Steinhardt School of Culture, Education and Human Development

Content tests for English Language Learners (ELLs) are an increasingly valuable component in

students’ scholastic portfolio and provide critical information upon which academic tracking, promotion,

and remediation decisions are made. Though ELL content assessment has been investigated primarily

through test accommodations on high-stakes tests, to date, there is little known about how ELLs take

content tests in mainstream classrooms; this despite many state departments of education (e.g.

Pennsylvania, Texas, North Carolina, and Florida) relying on classroom accommodations practices to

set a precedent for standardized testing protocols. This descriptive study contributes to the nascent body

of knowledge relating to teacher practices when evaluating ELLs’ academic performance in content

areas. Data collection consisted of three classroom observations in ten 4th

grade classrooms; initial

observations were conducted of routine math and/or science instruction followed by two observations of

math and/or science tests. Observations were followed by teacher interviews designed to explore their

decision-making processes during classroom assessment as well as when evaluating ELL student work.

Inductive coding was used to identify themes in teacher responses relating to decision-making in

assessment, accommodations, and grading practices. Findings suggested that teachers have great

autonomy in their assessment practices, with accommodations implementation varying widely

depending on teacher evaluation of student needs. Teacher perception of student language proficiency

proved a consistent criterion upon which decisions were based, though certain high-stakes

accommodations such as providing bilingual/translated tests or dictionary use were not observed during

any content test administrations. Accommodated grading practices were common, primarily through

systems which weighted student participation and effort above academic achievement. Of note was that

in a majority of classrooms, accommodations implementation did not differ between ELLs and students

with special needs; a fact which calls into question overall teacher understanding of language acquisition

processes.

Paper

Differential Item Functioning in a High Stakes Test

Mohammad Salehi

Sharif University of Technology, Tehran, Iran

Alireza Tayebi

Sharif University of Technology, Tehran, Iran

Validation is an important enterprise especially when the test to be validated is a high stakes one.

Messick’s notion of construct irrelevant factors is pertinent in test validation. Demographic variables

like gender, field of study and age can affect test results and interpretations. A fair test needs to be

neutral when it comes to construct irrelevant factors such as gender. Differential item functioning is a

way of making sure that the test does not favor one group of test takers over the others. The current

study investigated differential item functioning (DIF) in terms of gender in the reading comprehension

subtest of a high stakes test using a three-step logistic regression procedure (Zumbo, 1999). This test is

made up of three sections of grammar, vocabulary and reading comprehension comprising 100 items

among which the last 35 items of reading comprehension were investigated. The participants of the

study were 3,398 test takers, both males and females, who took the test in question (the UTEPT) as a

partial requirement for entering a PhD program at the University of Tehran. In order to show whether

the 35 items of reading comprehension exhibited DIF or not, logistic regression using a three step

procedure (Zumbo, 1999) was employed. Three sets of criteria were selected, namely, Cohen's (1988),

Zumbo's (1999), and Jodin and Girel's (2001). It was revealed that, though the 35 items of the reading

section show "small" effect sizes according to Cohen's classification, they do not display DIF based on

the other two criteria. Therefore, it can be concluded that the reading comprehension subtest of the

UTEPT favors neither males nor females.

Paper

Evaluating Oral Collocational Production to Predict L2 Oral Proficiency

Jing Xu

Iowa State University

The state-of-the-art automated speech evaluation (ASE) systems such as SpeechRater and Versant use a

subset of the criteria evaluated by human raters to predict human scores (Weigle, 2010; Xi, 2010).

However, for more accurate score prediction, the construct coverage of these systems need to be

expanded to include additional speech features that are construct-relevant and measurable (Xi, et al.,

2008). Collocations play a scaffolding role in building up oral language but these formulaic expressions

have been found to pose serious problems for second language (L2) learners (e.g., Ellis, 2008). Many

researchers have argued that L2 learners’ collocation use in spontaneous speech is a good indicator of

their oral proficiency (e.g., Handl, 2008). Hence, oral collocation production, if measured appropriately,

might be useful in improving ASE systems’ predicting power. The present study explores several ways

of evaluating collocations produced by L2 speakers and examines the extent to which appropriate

collocation use contributes to oral proficiency. Twenty Chinese learners of English representative of

four oral proficiency levels were randomly selected from a spoken corpus comprising speech samples of

an institutional oral English test. From the transcriptions of these speech samples, collocation strings of

10 syntactic patterns were manually extracted. These strings were then coded by six trained native-

English-speaking linguists for accuracy, semantic transparency, and frequency and by two non-native

linguists for difficulty. Based on this human coding, collocations produced by test takers of different

oral proficiency levels were compared. Further, collocation features deriving from the human coding

(e.g., difficult collocations attempted per 100 words, ratio of accurate collocations) were included in a

regression model to predict the speakers’ holistic test scores given by human raters according to a

comprehensive rating rubric covering pronunciation, vocabulary, and fluency.

Paper

Assessing Lower-order and Higher-order Listening Skills for ESL Students

Tatiana Nekrasova-Becker


Anthony Becker


Designing second-language (L2) tests to measure students' listening comprehension can be quite

challenging, as L2 teachers and/or test developers must consider the assessment of various lower-order

(e.g., identifying main ideas) and higher-order skills (e.g., inferencing or interpreting speakers’

attitudes). Often times, L2 listening tests are resigned to measuring lower-order skills, as they are

generally easier and more familiar to assess. However, neglecting higher-order skills, which are essential

for successful listening, can result in a narrowing of the listening construct, as it leaves out important

linguistic devices that speakers use to convey meaning to listeners (Buck, 2001; Wagner, 2004). In order

to confidently measure students’ listening comprehension, L2 listening tests need to maintain a balance

between the different types of lower-order and higher-order skills.

This empirical study investigated the performance of 87 examinees studying English as a second

language at an Intensive English Program (IEP) in the United States. Based on their responses to 30

listening items used in an IEP placement test, examinees’ performance on items measuring lower- and

higher-order listening skills was analyzed in the two proficiency groups. Specifically, the examinees’

scores for items measuring lower- and higher-order listening skills were compared for the two

proficiency groups and then correlated with their overall listening scores. The results indicated that,

while participants’ performance on both types of items (i.e., lower- and higher-order listening skills)

successfully distinguished between the two proficiency groups, the scores in both groups for items

measuring higher-order listening skills were more closely related (than scores for lower-order listening

skills) to the overall listening scores. The implications of these findings can help to raise awareness for

including both lower-order and higher-order skills in L2 listening tests, resulting in more informed

decisions that can lead to better assessment of L2 students’ listening comprehension.

Paper

Building Evidence for the Evaluation of English Learners’ Writing Scores

Anthony Becker


Performance-based assessment (PBA) has become the primary means for assessing L2 learners’ writing

abilities. Despite its popularity in L2 writing, validation is a contentious issue in PBA, as scoring rubrics

are rarely entered as evidence for the appropriateness of decisions made from test scores (Cumming et

al., 2004; Leung & Lewkowicz, 2006). In an argument-based approach to validity, the investigation of

scoring rubrics is an important aspect of a major inference, evaluation. Evidence that supports the

evaluation inference is crucial in the validation process, since decisions will likely be less valid if

scoring rubrics are not adequately constructed and appropriately used (Kane, 2006). To provide

evidence for the evaluation inference, five assumptions should be supported: a) rubrics capture relevant

aspects of performance at different score levels; b) attempts were made to standardize scoring

procedures; c) raters know how to implement rubrics; d) experienced and novice raters score students’

writing similarly; and e) rubrics function appropriately with their existing scales.

This research study was conducted in 2009-2010 whose purpose was two-fold: 1) to investigate the

quality of scoring rubrics used to assess L2 students’ writing ability at four Intensive English Programs

(IEPs) and 2) to determine the value of a proposed framework for examining the evaluation inference.

The study incorporated a multiple case-study methodology, whereby quantitative and qualitative

evidence was collected. The results indicated that rubrics captured relevant aspects of writing

performance and that the rating scales used in the rubrics functioned appropriately to distinguish

different levels of performance, despite the finding that teacher-raters could have benefitted from

additional scorer training. Also, the results of the investigation of the evaluation inference demonstrated

that IEP administrators perceived the framework as being coherent and useful, but that it could have

been more adequate and implementable.

Paper

Textual Borrowing and Rater Perceptions in Integrated Writing Tasks

Sara Cushing Weigle

Georgia State University

Megan Montee


Integrated assessment tasks are intended to more closely reflect language use in real-world academic

settings than tasks that measure only one skill. In the case of integrated reading and writing tasks, test

takers base their writing on one or more input reading texts. While integrated tasks offer important

benefits in terms of task authenticity, their use also raises issues about how test takers incorporate the

ideas and language from source texts into their writing. Textual borrowing refers to the direct use of

language from the source text. While previous research (Cumming et al., 2005; Weigle & Parker, 2010)

has looked at patterns of textual borrowing in writing assessment, no research to date has examined how

test raters perceive source-based writing and how textual borrowing may affect their ratings. Textual

borrowing may lead to inaccurate ratings by masking students’ writing proficiency (Weigle, 2002). In

addition, issues of what raters perceive as appropriate and inappropriate source-based writing, and the

extent to which these perceptions reflect the expectations of real-world academic writing, are essential to

task authenticity. In the context of second language writing and writing assessment, source-based

writing also raises questions of cultural differences in how writers use sources and attribute source

material.

This paper presents the results of an exploratory study of how test raters identify, perceive and make

scoring decisions about textual borrowing. The context of the study is a locally developed writing exam

used for placement in English as a Second Language courses. Data collection included focus groups and

stimulated recalls with test raters. Raters also completed a judgment task about the acceptability of

instances of textual borrowing in essay exams. This paper presents results from the study focusing on

implications for rubric design and rater training.

Paper

The Development of Specifications for an Oral Proficiency Test:

Contributions to Validity Claims

Francesca Di Silvio


Anne Donovan


Beth A. Mackey

Visiting Scholar, Center for Applied Linguistics

Detailed test specifications contribute to a validity argument by documenting what the test purports to

measure (Bachman & Palmer, 1996; Davidson & Lynch, 2002; Hughes, 1991). Bearing in mind the

audience of potential test users and test takers in specification design ensures consideration of issues of

consequential validity during the development process. Documented test specifications also facilitate

development of additional test items and test forms that are consistent in characteristics. Furthermore, an

iterative process of specification development based on feedback from internal review, field tests, and

operational use provides a clear record of evidence for validity claims (Bachman & Palmer, 1996;

Fulcher, 2003).

This paper describes the development of specifications for a computer-delivered oral proficiency test for

language learners of high school age and above that is aligned to the ACTFL Proficiency Guidelines—

Speaking. Specifications were designed based on operational tests in two languages that have been

shown to elicit results comparable to a preceding tape-mediated test of oral proficiency (Kenyon &

Malabonga, 2001) that demonstrated high correlations with the ACTFL OPI (Stansfield, 1990). The

purpose of the project was to write and pilot specifications to facilitate efficacious development of tests

in additional languages, with a particular focus on the item writing process which has been infrequently

addressed in the literature (Kim et al., 2010).

The paper presents methodology used to develop specifications consistent with Popham’s test

specification model (1978), including procedures for task writing, review, and banking. Presenters will

emphasize how the iterative process of test specification development contributes to validity claims.

Presenters will also discuss the issue of specificity in view of lessons learned during task writing and

implications for operational development. The iterative methodology presented may be replicated by

other language testers to build a validity argument as well as a foundation to support efficient test

development.

Paper

Assessing Learning Outcomes in Short-term Foreign Language Programs:

Validation Results of a Triangulated Assessment System

Megan C. Masters

University of Maryland

Steven J. Ross


Margaret E. Malone


There are currently no nationally-recognized, standardized assessment tools available, particularly in critical

languages, to document the language learning gains for beginning-level language learners participating in

short-term programs. To assess the effectiveness of short-term foreign language programs, reliable and valid

measures of learning outcomes are needed, especially since commonly used assessment tools, such as the

Oral Proficiency Interview (OPI), are not granular enough to document learner progress after only a few

weeks of study.

The current paper describes the results of a national, two-year study (N=700) of three different assessment

tools piloted for use to inform the development of a triangulated assessment system. The purpose of the

assessment system was to examine the reliability, consistency across language programs and convergent

validity of three instruments developed as indicators of learning outcomes. Participants were students in

grades 9-12 who were enrolled in short-term Arabic, Chinese and Hindi language programs. The programs

were designed to initiate and sustain interest in language study and to foster proficiency in strategically

important languages. The three assessment tools--the first primarily objective and the remaining subjective--

included: (1) a computerized proficiency test of four language skills, (2) a student self-assessment and (3) a

teacher assessment of student performance. This paper will explore the results of the pilot assessment

system, including the correlations between the three assessments on measuring student proficiency as well as

the outcomes of Rasch rating scale/mixed-scale modeling, which allows for direct comparison across the

three assessment instruments. The implications of these results will be examined in light of their provision of

empirically grounded evidence of student performance and program quality. Results will include an analysis

of cross-program comparability of outcomes, the evaluation of the impact of training on the alignment of the

triangulated assessment system and the identification of criteria to be considered for diagnostic feedback to

program directors and students on their use of the assessment instruments.

Paper

Investigating the Construct Validity of the Grammar Sub-Test of the CEP

Placement Exam

Payman Vafaee

Teachers College, Columbia University

Nesrine Basheer


Reese Heitner

Teachers College, Columbia University

An important assumption in language testing is that test items or observable variables correspond to the

structural relations hypothesized in the theoretical model or constructs governing the design of the

testing instrument (e.g., Shin, 2005). Accordingly, the purpose of the present construct validity study

was to investigate the extent to which scores from the grammar sub-test of the Columbia University

Community English Program (CEP) placement test could be interpreted as indicators of test takers’

grammatical knowledge. In the current study, we adopted Purpura’s (2004) model, which hypothesizes

that grammatical knowledge consists of two underlying factors of form and meaning. To this end, we

conducted a confirmatory factor analysis to investigate if the data from this test could fit this theoretical

model. In addition, since the test items were not discrete point but were nested within one of four tasks

(each with their own theme), the interactionist effects of these four themes on individual items were

also investigated. The data for this study was collected from the administration of the test to 144

participants. In preparation of the data for CFA, descriptive statistics were examined, and reliability

analysis and exploratory factor analysis were conducted. Upon examining several CFA models, a

multi-trait multi-method model achieved the best possible model-fit in accordance with substantive

considerations and issues of parsimony. In conclusion, this full-latent model, which included two trait

factors of grammatical form and meaning and four method factors, confirmed that the CEP test

examined the grammatical knowledge proposed in the hypothesized theoretical model. These findings

contribute to recent discussion concerning the importance of both construct definitions and method

effect in testing L2 grammatical knowledge.

Poster abstracts listed in alphabetical order Poster

Investigating Assessment Literacy through Teacher Professional Development

Aileen Bach


Anne Donovan


This poster will present findings of a study of assessment literacy of teachers of less commonly taught

languages; it describes the changes in instructor knowledge during a blended learning assessment course

that combines online and face-to-face formats. Assessment literacy is defined as the knowledge and

skills teachers need to accurately and effectively plan for and administer assessments and interpret and

apply the results (Boyles 2005; Taylor, 2009). Effective classroom assessment is an important

component of today’s foreign language classroom, because it provides insight for teachers on how to

improve teaching and learning in their classroom (Shepard, 2000). The nine-week course described in

this poster was designed for in-service teachers of less commonly taught languages. These teachers

usually work with limited teaching resources and often have limited training in pedagogy (Wan, 2009),

thus creating a great need for assessment literacy in this population.

This study provides insight into how to cultivate assessment literacy by tracking the progress and depth

of understanding of course participants over time by analyzing data from various sources. Data were

collected through pre- and post-course surveys, assessment tasks developed during the course,

discussion board posts, and observations. These data also explore what LCTL teachers know about

assessment, how their knowledge emerges through assessment practices, and what opportunities exist

for providing teacher professional development in order to strengthen their use of assessments. Since

this study explores the evolution of LCTL teachers’ assessment knowledge, beliefs, and practices, it

provides an important and often-overlooked foundation for the development of assessment literacy

training.

References

Boyles, P (2005) Assessment literacy. In Rosenbusch, M. (Ed.), National Assessment Summit Papers

(pp.11-15). Ames, Iowa: Iowa State University.

Shepard, L. (2000). The role of assessment in a learning culture. Educational Researcher, 29(7), 4-14.

Taylor, L. (2009). Developing assessment literacy. Annual Review of Applied Linguistics, 29, 21–36.

Wan, S. (2009). Preparing and supporting teachers of Less Commonly Taught Languages. Modern

Language Journal, 93(2), 282-287.

Poster

The Design of an Oral Test for Foreign Language Teachers:

Validating the Vocabulary Descriptors

Melissa Alves Baffi-Bonvino Universidade Estadual Paulista - Brazil

This poster will report on the results of a six-year research study about the process of assessing

vocabulary oral production in English as a Foreign Language (EFL) of undergraduate students in a pre-

service teacher education course at a public university in Brazil, who were preparing to enter the field of

English language teaching (ELT). It aimed to analyze students’ oral proficiency considering vocabulary

use in order to contribute for the validation of the vocabulary descriptors. The research is interpretive

and data analysis was developed by means of qualitative and quantitative procedures. Lexical

proficiency was assessed by means of an oral test designed as a pilot test (TEPOLI) and mock speaking

tests of the high stakes exams FCE and IELTS, whose approach to speaking is grounded in Bachman’s

(1990), Canale and Swain’s (1980) and Canale’s (1983) communicative competence models (Galaczi &

Khalifa, 2009). Individual interviews and questionnaires were also used to collect data, so that the

perceptions of the individuals are considered to rely on a larger view of the context. The analysis of

language data produced during the tests was carried out with the RANGE program (Victoria University

of Wellington), and a statistical analysis. The overall results indicate similarities in the levels of

proficiency obtained when data were compared and show that the connections between vocabulary and

oral proficiency can contribute to validate the vocabulary descriptors of TEPOLI by means of more

objective criteria to assess lexical oral proficiency. The study was carried out within a larger research

project about the process of validating a language proficiency examination for foreign language

teachers, grounded in the need to intervene in the Brazilian context of language teacher education to

establish clear criteria to assess language proficiency, in the language domain required for those

teachers.

Poster

Withdrawing from the Bank: Item Stability across Multiple Testing Contexts

Martyn Clark Center for Applied Second Language Studies

One of the advantages of a Rasch-calibrated item bank is that it allows the creation of a wide variety of

tailored tests geared towards particular objectives yet still related to each other (Wright & Stone, 1999).

The Chinese Computerized Assessment of Proficiency (CAP) is a web-based, low-stakes test of Chinese

proficiency developed to provide useful feedback to teachers and language programs. Items for the

Chinese CAP were developed and calibrated using Rasch analysis of pilot test data (N=600) and expert

review. Two new versions of the test were created for different contexts using a subset of these

calibrated items.

This presentation will briefly discuss the selection and calibration of items in the CAP item bank, and

then present a comparison with the results of two alternate ―forms‖ of the test drawn from the same

calibrated bank. In the first instance, a shortened version of the test was created for beginning students

studying Chinese in summer StarTalk programs (N=249) as part of a triangulation study between student

self-assessment, teacher retrospective assessment of student ability, and CAP scores. For this version of

CAP, items were chosen that targeted lower proficiency levels and matched the self-assessment ―Can

Do‖ statements common to the other instruments in study. In the second instance, a subset of items

geared towards intermediate-level proficiency was chosen to provide a pre/post measure of language

ability for a small group of students (N=19) participating in a summer study abroad program in China.

Item difficulty stability across these testing contexts is investigated through comparison of independent

Rasch item calibrations from each context with the original item bank values. Given that available

options for testing for less commonly taught languages are usually relatively scarce, this presentation

should be of interest to others facing similar situations.

Poster

Bridging the Gap: How Language Testers can Build Assessment Literacy in

Practitioners of Other Fields

Anne Donovan


Margaret E. Malone


Francesca Di Silvio


Megan Montee


Research on language assessment literacy, or the knowledge users need to have about language

assessment to make informed decisions (Inbar-Lourie, 2008; Taylor, 2009), has focused primarily on

instructional contexts and the needs of language teachers to conduct fair and valid assessments of their

students. However, language assessment literacy is important across a variety of contexts beyond

traditional language testing courses conducted for preservice teachers. This poster presents the initial

results of a multi-year project that examines the assessment literacy needs of two groups: Second

Language Acquisition (SLA) researchers and professionals who work in Language Teacher Education

(LTE). The project investigates the current assessment knowledge base of both groups to identify what

resources should be developed to meet the needs of these audiences. For the SLA context, the project

focuses on how collaboration can be facilitated between the fields of SLA and language testing in order

to support SLA researchers’ use of language tests in research contexts. For the LTE context, the project

focuses on the essential knowledge of assessment needed to provide appropriate pre- and in-service

professional development for language instructors. The project will develop assessment literacy

resources for each group.

This poster presents results from four focus groups conducted with SLA and LTE researchers and

practitioners. Focus groups were recorded, transcribed, and analyzed for major themes related to current

practices, challenges, and resources needed in language testing. The results of the focus groups will

inform the development of future phases of the project, including surveys of both groups’ assessment

needs. In addition to presenting the focus group results, the poster will discuss implications for

collaboration between language testers and other language professionals, and will raise questions about

the responsibility of the language testing community to make testing research accessible to colleagues in

related fields.

Poster

Transcription as a Language Testing Tool

Beth A. Mackey

U.S. Department of Defense

Foreign language specialists in the government find themselves constantly tested both formally and

informally, in the classroom and for professional certification. Informal language tests are created by

teachers for the pedagogical purposes of providing feedback to teacher and student, reaffirming and

measuring what learning has occurred and motivating students to retain and expand their language skills.

Federal agencies develop formal tests to assess foreign language proficiency (primarily in speaking as

well as in reading and listening comprehension) as a basis for making employment and career

development decisions. Government agencies also use task-based testing formats such as translation and

transcription (both verbatim and listening summaries) to measure the specific application of language

skills in a job-related context. While the use of translation as a measure of language proficiency has been

addressed in the literature (e.g., Buck, 1992), research on verbatim transcription in the foreign language

testing arena has been absent. This study (n=197) draws upon data collected by the Spanish Department

of a government language school. The dataset includes subscores in translation, transcription, and cloze.

Standardized tests of reading and listening proficiency are also available, allowing for a more thorough

exploration of transcription, translation, and their relationship with listening and reading skills. This

poster will explore transcription as a measure of listening comprehension, including correlation and

regression tables from the Spanish test results and qualitative input from an earlier survey of

transcription tests. The lack of support in the testing literature suggests that this topic deserves further

investigation.

Poster

Task Complexity Features and Speaking Test Performance

Megan Montee


Task complexity has become increasingly important in both Second Language Acquisition (SLA) and

language testing research, motivated in large part by the importance of language tasks for classroom

teaching and learning (Kim, 2009). However, the applications of SLA task-based research for language

testing are still unclear (Robinson, 2011) and there is a need for additional analysis of the relationship

between task features and language output in performance assessment tasks.

This poster presents the results of an exploratory study of task features and linguistic performance,

operationalized in terms of complexity, accuracy and fluency (CAF) measures. Data from the study

comes from a small corpus of transcribed responses to tasks from the WIDA ACCESS for ELLs

Speaking Test, an English language proficiency test used in U.S. public schools. The eight picture-based

tasks included in this study are intended to assess academic language use. However, there has been no

published research to date that systematically examines the relationship between the task features and

linguistic performance across various proficiency levels on the test. To explore this issue, the data

analysis for this study included coding each task for complexity features according to Robinson’s (2007)

task complexity framework. Student responses were then coded for multiple CAF measures, and these

results were compared with the task features as well as the scoring specifications for each task.

Results of this exploratory study have several implications. First, test developers may find these results

suggestive of the ways task features can be altered to elicit variations in examinee performance. Next,

the results indicate future directions for task-based research. Based on the results of the exploratory

analysis, this poster will present a proposed program of research to further explore task characteristics

and language production on the ACCESS Speaking Test.

Poster

Best Practices in Pilot Testing a High-stakes English Language Proficiency Test

Abbe Spokane


Tiffany Yanosky


Pilot testing—pre-operational research that is often small-scale and qualitative, and is intended to

identify necessary revisions in items—is a widespread practice but has little documentation in literature.

This poster aims to document and disseminate pilot testing methods for a high-stakes, large-scale

assessment of English language proficiency using the framework of Bachman’s (2005) Assessment Use

Argument (AUA). The poster will address several research questions: Why is pilot testing important for

language test development? What pilot testing methods do the presenters use? What defines these

methods as best practices? How do theoretical foundations such as Bachman’s AUA structure and

support pilot testing practices? The AUA, as applied to language testing, ―is an overall logical

framework for linking assessment performance to use (decisions)‖ (Bachman, 2005, p. 1). In pre-

operational testing, test developers claim that performance on new test items is based on the ability of

test takers rather than a construct-irrelevant variable and that the items should become operational. Data

collected during pilot testing consists of test takers’ responses to and feedback about items and serves as

backing to the original claim or to various rebuttals of that claim. The quality of the backing for the

claim and rebuttals leads to decisions about inclusion of the new test items on the operational test and

how the scores on those items should be interpreted and used to take actions related to test takers, such

as placement in programs or classes (Kenyon & MacGregor, in press). Adopting common terminology

for pre-operational practices, supporting pilot testing methods with theoretical models, and sharing best

practices would serve to enhance test validity arguments. We believe sharing our procedures will

encourage other test developers to do the same and will increase the use and quality of pilot testing and

produce more ethical and valid language tests.

Poster

Implementing Cognitive Diagnostic Assessment in an Institutional Test through

Collaboration of Language Testers: A New Networking Model in Language Testing

Yeon-Sook Yi

University of Illinois at Urbana-Champaign

Stephanie Gaillard University of Illinois at Urbana-Champaign

Cognitive diagnostic assessment (CDA) has gained attention in language testing since the late 90’s,

yielding encouraging results in general. However, all empirical studies used large-scale, standardized

tests. We attempt to expand this previous context of CDA by applying it to a college French placement

test, which will be the first application of CDA to institutional level language test development.

In order to do so, we report on a unique collaboration of researchers. A graduate student specializing in

measurement/language testing brings theoretical and empirical knowledge of CDA to the project, while

another student majoring in French/language testing provides linguistic expertise from the initial phase

of identifying attributes through the iterative process of refining the Q-matrix and final score reporting.

The ultimate results of this collaboration will be used to assign students to appropriate levels of language

class and the fine-grained feedback about students’ performance will be utilized in the classroom

teaching. More French teachers and second language learners are also involved in the project: Graduate

teaching assistants of French will participate in identifying language attributes and constructing a Q-

matrix and learners of French will contribute to specifying attributes of the test items through verbal

protocol reports.

We foresee that this innovative collaboration model can extend to other foreign language testing at our

university, hoping that the positive result of this research will establish very important empirical

evidence that different types of language tests in different settings can also benefit from the strength of

this new testing approach. We expect that this collaboration model will also help make the CDA method

more accessible to a wider public of language teachers and testers, a method that has been deemed as a

technically challenging multi-step procedure used only by a limited number of interested groups.

What is the East Coast Organization of Language Testers?

The East Coast Organization of Language Testers (ECOLT) represents an East Coast

group of professionals, scholars, and students who are involved in language testing

projects and research. One of the organization’s goals is to support connections between

academia, government, and testing organizations. In addition to providing a forum for

continued learning and networking, ECOLT strongly supports the work of graduate

students.

For more information about ECOLT, contact:

Dr. Margaret E. (Meg) Malone, Center for Applied Linguistics at: [email protected]

Program printed courtesy of Second Language Testing, Inc.

validity: opportunities & challenges - aubrey logan-terry...employs a computerized readability...

Documents