quantitative data analysis for - esec

49

Upload: others

Post on 21-Dec-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quantitative Data Analysis for - ESEC
Page 2: Quantitative Data Analysis for - ESEC

Quantitative Data Analysis for Language Assessment Volume I

Quantitative Data Analysis for Language Assessment Volume I: Fundamental Techniques is a resource book that presents the most fundamental techniques of quantitative data analysis in the field of language assessment. Each chapter provides an accessible explanation of the selected technique, a review of language assessment studies that have used the technique, and finally, an example of an authentic study that uses the technique. Readers also get a taste of how to apply each technique through the help of supplementary online resources that include sample datasets and guided instructions. Language assessment students, test designers, and researchers should find this a unique reference, as it consolidates theory and application of quantitative data analysis in language assessment.

Vahid Aryadoust is an Assistant Professor of language assessment literacy at the National Institute of Education of Nanyang Technological University, Singapore. He has led a number of language assessment research projects funded by, for example, the Ministry of Education (Singapore), Michigan Language Assessment (USA), Pearson Education (UK), and Paragon Testing Enterprises (Canada), and published his research in, for example, Language Testing, Language Assessment Quarterly, Assessing Writing, Educational Assessment, Educational Psychology, and Computer Assisted Language Learning. He has also (co)authored a number of book chapters and books that have been published by Routledge, Cambridge University Press, Springer, Cambridge Scholar Publishing, Wiley Blackwell, etc. He is a member of the Advisory Board of multiple international journals including Language Testing, Language Assessment Quarterly, Educational Assessment, Educational Psychology, and Asia Pacific Journal of Education. In addition, he has been awarded the Intercontinental Academia Fellowship (2018–2019) which is an advanced research program launched by the University-Based Institutes for Advanced Studies. Vahid’s areas of interest include theory-building and quantitative data analysis in language assessment, neuroimaging in language comprehension, and eye-tracking research.

Michelle Raquel is a Senior Lecturer at the Centre of Applied English Studies, University of Hong Kong, where she teaches language testing and assessment to postgraduate students. She has extensive assessment development and management experience in the Hong Kong education and government sector. In particular, she has either led or been part of a group that designed and administered large-scale computer-based language proficiency and diagnostic assessments such as the Diagnostic English Language Tracking Assessment (DELTA). She specializes in data analysis, specifically Rasch measurement, and has published several articles in international journals on this topic as well as academic English, diagnostic assessment, dynamic assessment of English second-language dramatic skills, and English for specific purposes (ESP) testing. Michelle’s research areas are classroom-based assessment, diagnostic assessment, and workplace assessment.

Page 3: Quantitative Data Analysis for - ESEC

Routledge Research in Language Education

The Routledge Research in Language Education series provides a platform for established and emerging scholars to present their latest research and discuss key issues in Language Education. This series welcomes books on all areas of language teaching and learning, including but not limited to language education policy and politics, multilingualism, literacy, L1, L2 or foreign language acquisition, curriculum, classroom practice, pedagogy, teaching materials, and language teacher education and development. Books in the series are not limited to the discussion of the teaching and learning of English only.

Books in the series include

Interdisciplinary Research Approaches to Multilingual EducationEdited by Vasilia Kourtis-Kazoullis, Themistoklis Aravossitas, Eleni Skourtou and Peter Pericles Trifonas

From language skills to literacyBroadening the scope of English language education through media literacyCsilla Weninger

Addressing Difficult Situations in Foreign-Language LearningConfusion, Impoliteness, and HostilityGerrard Mugford

Translanguaging in EFL ContextsA Call for ChangeMichael Rabbidge

Quantitative Data Analysis for Language Assessment Volume IFundamental TechniquesEdited by Vahid Aryadoust and Michelle Raquel

For more information about the series, please visit www.routledge.com/Routledge- Research-in-Language-Education/book-series/RRLE

Page 4: Quantitative Data Analysis for - ESEC

Quantitative Data Analysis for Language Assessment Volume IFundamental Techniques

Edited by Vahid Aryadoust and Michelle Raquel

Page 5: Quantitative Data Analysis for - ESEC

First published 2019 by Routledge 2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN

and by Routledge 52 Vanderbilt Avenue, New York, NY 10017

Routledge is an imprint of the Taylor & Francis Group, an informa business

© 2019 selection and editorial matter, Vahid Aryadoust and Michelle Raquel; individual chapters, the contributors

The right of Vahid Aryadoust and Michelle Raquel to be identified as the authors of the editorial material, and of the authors for their individual chapters, has been asserted in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988.

All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers.

Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library

Library of Congress Cataloging-in-Publication Data A catalog record for this book has been requested

ISBN: 978-1-138-73312-1 (hbk) ISBN: 978-1-315-18781-5 (ebk)

Typeset in Galliard by Apex CoVantage, LLC

Visit the eResources: www.routledge.com/9781138733121

Page 6: Quantitative Data Analysis for - ESEC

List of figures viiList of tables ixPreface xiEditor and contributor biographies xiii

Introduction 1VAHID ARYADOUST AND MICHELLE RAQUEL

SECTION I Test development, reliability, and generalizability 13

1 Item analysis in language assessment 15RITA GREEN

2 Univariate generalizability theory in language assessment 30YASUYO SAWAKI AND XIAOMING XI

3 Multivariate generalizability theory in language assessment 54KIRBY C. GRABOWSKI AND RONGCHAN LIN

SECTION II Unidimensional Rasch measurement 81

4 Applying Rasch measurement in language assessment: unidimensionality and local independence 83JASON FAN AND TREVOR BOND

5 The Rasch measurement approach to differential item functioning (DIF) analysis in language assessment research 103MICHELLE RAQUEL

Contents

Page 7: Quantitative Data Analysis for - ESEC

vi Contents

6 Application of the rating scale model and the partial credit model in language assessment research 132IKKYU CHOI

7 Many-facet Rasch measurement: implications for rater-mediated language assessment 153THOMAS ECKES

SECTION III Univariate and multivariate statistical analysis 177

8 Analysis of differences between groups: the t-test and the analysis of variance (ANOVA) in language assessment 179TUĞBA ELIF TOPRAK

9 Application of ANCOVA and MANCOVA in language assessment research 198ZHI LI AND MICHELLE Y. CHEN

10 Application of linear regression in language assessment 219DAERYONG SEO AND HUSEIN TAHERBHAI

11 Application of exploratory factor analysis in language assessment 243LIMEI ZHANG AND WENSHU LUO

Index 262

Page 8: Quantitative Data Analysis for - ESEC

1.1 Facility values and distracter analysis 21 1.2 Discrimination indices 22 1.3 Facility values, discrimination, and internal consistency

(reliability) 23 1.4 Task statistics 23 1.5 Distracter problems 24 2.1 A one-facet crossed design example 36 2.2 A two-facet crossed design example 37 2.3 A two-facet partially nested design example 38 3.1 Observed-score variance as conceptualized through CTT 55 3.2 Observed-score variance as conceptualized through G theory 56 4.1 Wright map presenting item and person measures 93 4.2 Standardized residual first contrast plot 96 5.1 ICC of an item with uniform DIF 105 5.2 ICC of an item with non-uniform DIF 106 5.3 Standardized residual plot of 1st contrast 115 5.4 ETS DIF categorization of DIF items based on DIF size

and statistical significance 116 5.5 Sample ICC of item with uniform DIF (positive DIF contrast) 119 5.6 Sample ICC of item with uniform DIF (negative DIF contrast) 119 5.7 Macau high-ability students (M2) vs. HK high-ability

students (H2) sample ICCs of an item with NUDIF (positive DIF contrast) 121

5.8 Macau high-ability students (M2) vs. HK high-ability students (H2) sample ICCs of an item with NUDIF (negative DIF contrast) 121

5.9 Plot diagram of person measures with vs. without DIF items 124 6.1 Illustration of the RSM assumption 136 6.2 Distributions of item responses 140 6.3 Estimated response probabilities for Items 1, 2, and 3 from

the RSM (dotted lines) and the PCM (solid lines) 143

Figures

Page 9: Quantitative Data Analysis for - ESEC

viii Figures

6.4 Estimated standard errors for person parameters and test information from the RSM (dotted lines) and the PCM (solid lines) 145

6.5 Estimated response probabilities for Items 6, 7 and 8 from the RSM (dotted lines) and the PCM (solid lines), with observed response proportions (unfilled circles) 146

7.1 The basic structure of rater-mediated assessments 154 7.2 Fictitious dichotomous data: Responses of seven test takers

to five items scored as correct (1) or incorrect (0) 155 7.3 Illustration of a two-facet dichotomous Rasch model

(log odds form) 156 7.4 Fictitious polytomous data: Responses of seven test takers

evaluated by three raters on five criteria using a five-category rating scale 157

7.5 Illustration of a three-facet rating scale measurement model (log odds form) 157

7.6 Studying facet interrelations within a MFRM framework 160 7.7 Wright map for the three-facet rating scale analysis of the sample

data (FACETS output, Table 6.0: All facet vertical “rulers”) 162 7.8 Illustration of the MFRM score adjustment procedure 167 9.1 An example of boxplots 202 9.2 Temporal distribution of ANCOVA/MANCOVA-based

publications in four language assessment journals 204 9.3 A matrix of scatter plots 21210.1 Plot of regression line graphed on a two-dimensional chart

representing X and Y axes 22110.2 Plot of residuals vs. predicted Y scores where the assumption

of linearity holds for the distribution of random errors 22410.3 Plot of residuals vs. predicted Y scores where the assumption

of linearity does not hold 22410.4 Plot of standardized residuals vs. predicted values of the

dependent variable that depicts a violation of homoscedasticity 22710.5 Histogram of residuals 23710.6 Plot of predicted values vs. residuals 23811.1 Steps in running EFA 24511.2 Scatter plots to illustrate relationships between variables 24711.3 Scree plot for the ReTSUQ data 256

Page 10: Quantitative Data Analysis for - ESEC

2.1 Key Steps for Conducting a G Theory Analysis 34 2.2 Data Setup for the p × i Study Example With 30 Items (n = 35) 42 2.3 Expected Mean Square (EMS) Equations (the p × i Study Design) 42 2.4 G-study Results (the p × i Study Design) 43 2.5 D-study Results (the p × I Study Design) 44 2.6 Rating Design for the Sample Application 47 2.7 G- and D-study Variance Component Estimates for the p × r ′

Design (Rating Method) 48 2.8 G- and D-study Variance Component Estimates for the p × r

Design (Subdividing Method) 49 3.1 Areas of Investigation and Associated Research Questions 59 3.2 Research Questions and Relevant Output to Examine 63 3.3 Variance Component Estimates for the Four Subscales

(p• × T • × R• Design; 2 Tasks and 2 Raters) 69 3.4 Variance and Covariance Component Estimates for the Four

Subscales (p• × T • × R• Design; 2 Tasks and 2 Raters) 72 3.5 G Coefficients for the Four Subscales (p• × T • × R• Design) 73 3.6 Universe-Score Correlations Between the Four Subscales

(p• × T • × R• Design) 74 3.7 Effective Weights of Each Subscale to the Composite

Universe-Score Variance (p• × T • × R• Design) 74 3.8 Generalizability Coefficients for the Subscales When Varying

the Number of Tasks (p• × T • × R• Design) 75 4.1 Structure of the FET Listening Test 90 4.2 Summary Statistics for the Rasch Analysis 92 4.3 Rasch Item Measures and Fit Statistics (N = 106) 94 4.4 Standardized Residual Variance 96 4.5 Largest Standardized Residual Correlations 98 5.1 Selected Language-Related DIF Studies 109 5.2 Listening Sub-skills in the DELTA Listening Test 112 5.3 Rasch Analysis Summary Statistics (N = 2,524) 113 5.4 Principal Component Analysis of Residuals 114 5.5 Approximate Relationships Between the Person Measures

in PCAR Analysis 115

Tables

Page 11: Quantitative Data Analysis for - ESEC

x Tables

5.6 Items With Uniform DIF 118 5.7 Number of NUDIF Items 120 5.8 Texts Identified by Expert Panel to Potentially Disadvantage

Macau Students 122 6.1 Item Threshold Parameter Estimates (With Standard Errors in

the Parentheses) From the RSM and the PCM 142 6.2 Person Parameter Estimates and Test Information From the

RSM and the PCM 144 6.3 Infit and Outfit Mean Square Values From the RSM and

the PCM 146 7.1 Excerpt From the FACETS Rater Measurement Report 164 7.2 Excerpt From the FACETS Test Taker Measurement Report 166 7.3 Excerpt From the FACETS Criterion Measurement Report 168 7.4 Separation Statistics and Facet-Specific Interpretations 169 8.1 Application of the t-Test in the Field of Language Assessment 183 8.2 Application of ANOVA in Language Testing and Assessment 189 8.3 Descriptive Statistics for Groups’ Performances on the

Reading Test 193 9.1 Summary of Assumptions of ANCOVA and MANCOVA 201 9.2 Descriptive Statistics of the Selected Sample 209 9.3 Descriptive Statistics of Overall Reading Performance

and Attitude by Sex-Group 210 9.4 ANCOVA Summary Table 210 9.5 Estimated Marginal Means 211 9.6 Descriptive Statistics of Reading Performance and Attitude

by Sex-Group 213 9.7 MANCOVA Summary Table 213 9.8 Summary Table for ANCOVAs of Each Reading Subscale 21410.1 Internal and External Factors Affecting ELLs’ Language

Proficiency 23110.2 Correlation Matrix of the Dependent Variable and the

Independent Variables 23410.3 Summary of Stepwise Selection 23510.4 Analysis of Variance Output for Regression, Including All

the Variables 23510.5 Parameter Estimates With Speaking Included 23510.6 Analysis of Variance Without Including Speaking 23610.7 Parameter Estimates of the Four Predictive Variables, Together

With Their Variation Inflation Function 23610.8 Partial Output as an Example of Residuals and Predicted Values

(a Model Without Speaking) 23711.1 Categories and Numbers of Items in the ReTSUQ 25311.2 Part of the Result of Unrotated Principal Component Analysis 25511.3 The Rotated Pattern Matrix of the ReTSUQ (n = 650) 257

Page 12: Quantitative Data Analysis for - ESEC

Preface

The two-volume books, Quantitative Data Analysis for Language Assessment (Fun-damental Techniques and Advanced Methods), together with the Companion web-site, were motivated by the growing need for a comprehensive sourcebook of quantitative data analysis for the community of language assessment. As the focus on developing valid and useful assessments continues to intensify in different parts of the world, having a robust and sound knowledge of quantitative methods has become an increasingly essential requirement. This is particularly important given that one of the community’s responsibilities is to develop language assessments that have evi-dence of validity, fairness, and reliability. We believe this would be achieved primarily by leveraging quantitative data analysis in test development and validation efforts.

It has been the contributors’ intention to write the chapters with an eye toward what professors, graduate students, and test-development companies would need. The chapters progress gradually from fundamental concepts to advanced topics, making the volumes suitable reference books for professors who teach quantitative methods. If the content of the volumes is too heavy for teaching in one course, we would suggest professors consider using them across two semesters, or alterna-tively choose any chapters that fit the focus and scope of their courses. For gradu-ate students who have just embarked on their studies or are writing dissertations or theses, the two volumes would serve as a cogent and accessible introduction to the methods that are often used in assessment development and validation research. For organizations in the test-development business, the volumes provide a unique topic coverage and examples of applications of the methods in small- and large-scale language tests that such organizations often deal with.

We would like to thank all of the authors who contributed their expertise in language assessment and quantitative methods. This collaboration has allowed us to emphasize the growing interdisciplinarity in language assessment that draws knowledge and information from many different fields. We wish to acknowledge that in addition to editorial reviews, each chapter has been subjected to rigorous double-blind peer review. We extend a special note of thanks to a number of col-leagues who helped us during the review process:

Beth Ann O’Brien, National Institute of Education, SingaporeChristian Spoden, The German Institute for Adult Education, Leibniz Centre

for Lifelong Learning, Germany

Page 13: Quantitative Data Analysis for - ESEC

xii Preface

Tuğba Elif Toprak, Izmir Bakircay University, TurkeyGuangwei Hu, Hong Kong Polytechnic University, Hong KongHamdollah Ravand, Vali-e-Asr University of Rafsanjan, IranIkkyu Choi, Educational Testing Service, USAKirby C. Grabowski, Teachers College Columbia University, USAMehdi Riazi, Macquarie University, AustraliaMoritz Heene, Ludwig-Maximilians-Universität München, GermanyPurya Baghaei, Islamic Azad University of Mashad, IranShane Phillipson, Monash University, AustraliaShangchao Min, Zhejiang University, ChinaThomas Eckes, Gesellschaft für Akademische Studienvorbereitung und Teste-

ntwicklung e. V. c/o TestDaF-Institut Ruhr-Universität Bochum, GermanyTrevor Bond, James Cook University, AustraliaWenshu Luo, National Institute of Education, SingaporeYan Zi, The Education University of Hong Kong, Hong KongYasuyo Sawaki, Waseda University, JapanYo In’nami, Chuo University, JapanZhang Jie, Shanghai University of Finance and Economics, China

We hope that the readers will find the volumes useful in their research and pedagogy.Vahid Aryadoust and Michelle Raquel

EditorsApril 2019

Page 14: Quantitative Data Analysis for - ESEC

Editor and contributor biographies

Vahid Aryadoust is Assistant Professor of language assessment literacy at the National Institute of Education of Nanyang Technological University, Singa-pore. He has led a number of language assessment research projects funded by, for example, the Ministry of Education (Singapore), Michigan Language Assessment (USA), Pearson Education (UK), and Paragon Testing Enterprises (Canada), and published his research in Language Testing, Language Assess-ment Quarterly, Assessing Writing, Educational Assessment, Educational Psy-chology, and Computer Assisted Language Learning. He has also (co)authored a number of book chapters and books that have been published by Routledge, Cambridge University Press, Springer, Cambridge Scholar Publishing, Wiley Blackwell, etc.

Trevor Bond is an Adjunct Professor in the College of Arts, Society and Educa-tion at James Cook University Australia and the senior author of the book Applying the Rasch Model: Fundamental Measurement in the Human Sciences. He consults with language assessment researchers in Hong Kong and Japan and with high-stakes testing teams in the US, Malaysia, and the UK. In 2005, he instigated the Pacific Rim Objective Measurement Symposia (PROMS), now held annually across East Asia. He is a regular keynote speaker at inter-national measurement conferences, runs Rasch measurement workshops, and serves as a specialist reviewer for academic journals.

Michelle Y. Chen is a research psychometrician at Paragon Testing Enterprises. She received her Ph.D. in measurement, evaluation, and research method-ology from the University of British Columbia (UBC). She is interested in research that allows her to collaborate and apply psychometric and statistical techniques. Her research focuses on applied psychometrics, validation, and language testing.

Ikkyu Choi is a Research Scientist in the Center for English Language Learn-ing and Assessment at Educational Testing Service. He received his Ph.D. in applied linguistics from the University of California, Los Angeles in 2013, with a specialization in language assessment. His research interests include second-language development profiles, test-taking processes, scoring of constructed responses, and quantitative research methods for language assessment data.

Page 15: Quantitative Data Analysis for - ESEC

xiv Editor and contributor biographies

Thomas Eckes is Head of the Psychometrics and Language Testing Research Department, TestDaF Institute, University of Bochum, Germany. His research focuses on psychometric modeling of language competencies, rater effects in large-scale assessments, and the development and validation of web-based lan-guage placement tests. He is on the editorial boards of the journals Language Testing and Assessing Writing. His book Introduction to Many-Facet Rasch Measurement (Peter Lang) appeared in 2015 in a second, expanded edition. He was also guest editor of a special issue on advances in IRT modeling of rater effects (Psychological Test and Assessment Modeling, Parts I & II, 2017, 2018).

Jason Fan is a Research Fellow at the Language Testing Research Centre (LTRC) at the University of Melbourne, and before that, an Associate Professor at College of Foreign Languages and Literature, Fudan University. His research interests include the validation of language assessments and quantitative research methods. He is the author of Development and Validation of Stan-dards in Language Testing (Shanghai: Fudan University Press, 2018) and the co-author (with Tim McNamara and Ute Knoch) of Fairness and Justice in Language Assessment: The Role of Measurement (Oxford: Oxford University Press, 2019, in press).

Kirby C. Grabowski is Adjunct Assistant Professor of Applied Linguistics and TESOL at Teachers College, Columbia University, where she teaches courses on second-language assessment, performance assessment, generalizability the-ory, pragmatics assessment, research methods, and linguistics. Dr. Grabowski is currently on the editorial advisory board of Language Assessment Quarterly and formerly served on the Board of the International Language Testing Associa-tion as Member-at-Large. Dr. Grabowski was a Spaan Fellow for the English Language Institute at the University of Michigan, and she received the 2011 Jacqueline Ross TOEFL Dissertation Award for outstanding doctoral disser-tation in second/foreign language testing from Educational Testing Service.

Rita Green is a Visiting Teaching Fellow at Lancaster University, UK. She is an expert in the field of language testing and has trained test development teams for more than 30 years in numerous projects around the world including those in the fields of education, diplomacy, air traffic control, and the military. She is the author of Statistical Analyses for Language Testers (2013) and Designing Listen-ing Tests: A Practical Approach (2017), both published by Palgrave Macmillan.

Zhi Li is an Assistant Professor in the Department of Linguistics at the University of Saskatchewan (UoS), Canada. Before joining UoS, he worked as a language assessment specialist at Paragon Testing Enterprises, Canada, and a sessional instructor in the Department of Adult Learning at the University of the Fra-ser Valley, Canada. Zhi Li holds a doctoral degree in applied linguistics and technology from Iowa State University, USA. His research interests include language assessment, technology-supported language teaching and learning, corpus linguistics, and computational linguistics. His research papers have been published in System, CALICO Journal, and Language Learning & Technology.

Page 16: Quantitative Data Analysis for - ESEC

Editor and contributor biographies xv

Rongchan Lin is a Lecturer at National Institute of Education, Nanyang Tech-nological University, Singapore. She has received awards and scholarships such as the 2017 Asian Association for Language Assessment Best Student Paper Award, the 2016 and 2017 Confucius China Studies Program Joint Research Ph.D. Fellowship, the 2014 Tan Ean Kiam Postgraduate Scholarship (Humanities), and the 2012 Tan Kah Kee Postgraduate Scholarship. She was named the 2016 Joan Findlay Dunham Annual Fund Scholar by Teachers Col-lege, Columbia University. Her research interests include integrated language assessment and rubric design.

Wenshu Luo is an Assistant Professor at National Institute of Education (NIE), Nanyang Technological University, Singapore. She obtained her Ph.D. in edu-cational psychology from the University of Hong Kong. She teaches quantita-tive research methods and educational assessment across a number of programs for in-service teachers in NIE. She is an active researcher in student motivation and engagement and has published a number of papers in top journals in this area. She is also enthusiastic to find out how cultural and contextual factors influence students’ learning, such as school culture, leadership practices, class-room practices, and parenting practices.

Michelle Raquel is a Senior Lecturer at the Centre of Applied English Studies, University of Hong Kong, where she teaches language testing and assessment to postgraduate students. She has worked in several tertiary institutions in Hong Kong as an assessment developer and has either led or been part of a group that designed and administered large-scale diagnostic and language pro-ficiency assessments. She has published several articles in international journals on academic English diagnostic assessment, ESL testing of reading and writ-ing, dynamic assessment of second-language dramatic skills, and English for specific purposes (ESP) testing.

Yasuyo Sawaki is a Professor of Applied Linguistics at the School of Education, Waseda University in Tokyo, Japan. Sawaki is interested in a variety of research topics in language assessment ranging from the validation of large-scale inter-national English language assessments to the role of assessment in classroom English language instruction. Her current research topics include examining summary writing performance of university-level Japanese learners of English and English-language demands in undergraduate- and graduate-level content courses at universities in Japan.

Daeryong Seo is a Senior Research Scientist at Pearson. He has led various state assessments and brings international psychometric experience through his work with the Australian NAPLAN and Global Scale of English. He has published several studies in international journals and presented numerous psychometric issues at international conferences, such as the American Edu-cational Research Association (AERA). He also served as a Program Chair of the Rasch special-interest group, AERA. In 2013, he and Dr. Taherb-hai received an outstanding paper award from the California Educational

Page 17: Quantitative Data Analysis for - ESEC

xvi Editor and contributor biographies

Research Association. Their paper is titled “What Makes High School Asian English Learners Tick?”

Husein Taherbhai is a retired Principal Research Scientist who led large-scale assessments in the U.S. for states, such as Arizona, Washington, New York, Maryland, Virginia, Tennessee, etc., and for the National Physical Therapists’ Association’s licensure examination. Internationally, Dr. Taherbhai led the Educational Quality and Assessment Office in Ontario, Canada, and worked for the Central Board of Secondary Education’s Assessment in India. He has published in various scientific journals and has reviewed and presented at the NCME, AERA, and Florida State conferences with papers relating to language learners, rater effects, and students’ equity and growth in education.

Tuğba Elif Toprak is an Assistant Professor of Applied Linguistics/ELT at Izmir Bakircay University, Izmir, Turkey. Her primary research interests are implementing cognitive diagnostic assessment by using contemporary item response theory models and blending cognition with language assessment in her research. Dr. Toprak has been collaborating with international researchers on several research projects that are largely situated in the fields of language assessment, psycholinguistics, and the learning sciences. Her current research interest includes intelligent real-time assessment systems, in which she com-bines techniques from several areas such as the learning sciences, cognitive science, and psychometrics.

Xiaoming Xi is Executive Director of Global Education and Workforce at ETS. Her research spans broad areas of theory and practice, including validity, fair-ness, test validation methods, approaches to defining test constructs, validity frameworks for automated scoring, automated scoring of speech, the role of technology in language assessment and learning, and test design, rater, and scoring issues. She is co-editor of the Routledge book series Innovations in Language Learning and Assessment and is on the Editorial Boards of Lan-guage Testing and Language Assessment Quarterly. She received her Ph.D. in language assessment from UCLA.

Limei Zhang is a lecturer at the Singapore Centre for Chinese language, Nan-yang Technological University. She obtained her Ph.D. in applied linguistics with emphasis on language assessment from National Institute of Education, Nanyang Technological University. Her research interests include language assessment literacy, reading and writing assessment, and learners’ metacogni-tion. She has published papers in journals including The Asia-Pacific Education Researcher, Language Assessment Quarterly, and Language Testing. Her most recent book is Metacognitive and Cognitive Strategy Use and EFL Reading Test Performance: A Study of Chinese College Students (Springer).

Page 18: Quantitative Data Analysis for - ESEC

Quantitative techniques are mainstream components in most of the published literature in language assessment, as they are essential in test development and validation research (Chapelle, Enright, & Jamieson, 2008). There are three fami-lies of quantitative methods adopted in language assessment research: measure-ment models, statistical methods, and data mining (although admittedly, setting a definite boundary between this classification of methods would not be feasible).

Borsboom (2005) proposes that measurement models, the first family of quan-titative methods in language assessment, either fall in the paradigm of classical test theory (CTT), Rasch measurement, or item response theory (IRT). The common feature of the three measurement techniques is that they are intended to predict outcomes of cognitive, educational, and psychological testing. However, they do have significant differences in their underlying assumptions and applications. CTT is founded on true scores, which can be estimated by using the error of measure-ment and observed scores. Internal consistency reliability and generalizability theory are also formulated based on CTT premises. Rasch measurement and IRT, on the other hand, are probabilistic models that are used for the measurement of latent variables – attributes that are not directly observed. There are a number of unidimensional Rasch and IRT models, which assume the attribute underlying test performance comprises only one measurable feature. There are also multidi-mensional models that postulate that latent variables measured by tests are many and multidivisible. Determining whether a test is unidimensional or multidimen-sional requires theoretical grounding, the application of sophisticated quantitative methods, and an evaluation of the test context. For example, multidimensional tests can be used to provide fine-grained diagnostic information to stakeholders, and thus a multidimensional IRT model can be used to derive useful diagnostic information from test scores. In the current two volumes, CTT and unidimen-sional Rasch models are discussed in Volume I, and multidimensional techniques are covered in Volume II.

The second group of methods is statistical and consists of the commonly used methods in language assessment such as t-tests, analysis of variance (ANOVA), analysis of covariance (ANCOVA), multivariate analysis of covari-ance (MANCOVA), regression models, and factor analysis, which are cov-ered in Volume I. In addition, multilevel modeling and structural equation

Introduction

Vahid Aryadoust and Michelle Raquel

Page 19: Quantitative Data Analysis for - ESEC

2 Vahid Aryadoust and Michelle Raquel

modeling (SEM) are presented in Volume II. The research questions that these techniques aim to address range from comparing average performances of test takers to prediction and data reduction. The third group of models falls under the umbrella of data mining techniques, which we believe are a relatively underresearched and underutilized technique in language assess-ment. Volume II presents two data mining methods: classification and regres-sion trees (CART) and the evolutionary algorithm-based symbolic regression, both of which are used for prediction and classification. These methods detect the relationship between dependent and independent variables in the form of mathematical functions which confirm postulated relationships between vari-ables across separate datasets. This feature of the two data mining techniques, discussed in Volume II, improves the precision and generalizability of the detected relationships.

We provide an overview of the two volumes in the next sections.

Quantitative Data Analysis for Language Assessment Volume I: Fundamental TechniquesThis volume is comprised of 11 chapters contributed by a number of experts in the field of language assessment and quantitative data analysis techniques. The aim of the volume is to revisit the fundamental quantitative topics that have been used in the language assessment literature and shed light on their rationales and assumptions. This is achieved through delineating the technique covered in each chapter, providing a (brief) review of its application in previous language assessment research, and giving a theory-driven example of the application of the technique. The chapters in Volume I are grouped into three main sections, which are discussed below.

Section I. Test development, reliability, and generalizability

Chapter 1: Item analysis in language assessment (Rita Green)

This chapter deals with the fundamental, but, as Rita Green notes, an often-delayed step in language test development. Item analysis is a quantitative method that allows test developers to examine the quality of test items, i.e., which test items are working well (constructed to assess the construct they are meant to assess) and which items should be revised or dropped to improve overall test reliability. Unfortunately, as the author notes, this step is commonly done after a test has been administered and not when items have just been developed. The chapter starts with an explanation of the importance of this method at the test-development stage. Then several language testing studies that have utilized this method to investigate test validity and reliability, to improve standard-setting ses-sions, and to investigate the impact of test format and different testing conditions on test taker performance are reviewed. The author further emphasizes the need for language testing professionals to learn this method and its link to language

Page 20: Quantitative Data Analysis for - ESEC

Introduction 3

assessment research by suggesting five research questions in item analysis. The use of this method is demonstrated by an analysis of a multiple-choice grammar and vocabulary test. The author concludes the chapter by demonstrating how the analysis can answer the five research questions proposed, as well as suggestions on how to improve the test.

Chapter 2: Univariate generalizability theory in language assessment (Yasuyo Sawaki and Xiaoming Xi)

In addition to item analysis, investigating reliability and generalizability is a fun-damental consideration of test development. Chapter 2 presents and extends the framework to investigate reliability within the paradigm of classical test theory (CTT). Generalizability theory (G theory) is a powerful method of investigat-ing the extent in which scores are reliable, as it is able to account for different sources of variability and their interactions in one analysis. The chapter provides an overview of the key concepts in this method, outlines the steps in the analyses, and presents an important caveat in the application of this method, i.e., concep-tualization of an appropriate rating design that fits the context. A sample study demonstrating the use of this method is presented to investigate the dependability of ratings given on an English as a foreign language (EFL) summary writing task. The authors compared the results of two G theory analyses, the rating method and the block method, to demonstrate to readers the impact of rating design on the results of the analysis. The chapter concludes with a discussion of the strengths of the analysis compared to other CTT-based reliability indices, the value of this method in investigating rater behavior, and suggested references should readers wish to extend their knowledge of this technique.

Chapter 3: Multivariate generalizability theory in language assessment (Kirby C. Grabowski and Rongchan Lin)

In performance assessments, multiple factors contribute to generate a test taker’s overall score, such as task type, the rating scale structure, and the rater, meaning that scores are influenced by multiple sources of variance. Although univariate G theory analysis is able to determine the reliability of scores, it is limited in that it does not consider the impact of these sources of variance simultaneously. Multi-variate G theory analysis is a powerful statistical technique, as in addition to results generated by univariate G theory analysis, it is able to generate a reliability index accounting for all these factors in one analysis. The analysis is also able to consider the impact of subscales of a rating scale. The authors begin the chapter with an overview of the basic concepts of multivariate G theory. Next, they illustrate an application of this method through an analysis of a listening-speaking test where they make clear links between research questions and the results of the analysis. The chapter concludes with caveats in the use of this method and suggested refer-ences for readers who wish to complement their MG theory analyses with other methods.

Page 21: Quantitative Data Analysis for - ESEC

4 Vahid Aryadoust and Michelle Raquel

Section II. Unidimensional Rasch measurement

Chapter 4: Applying Rasch measurement in language assessment: unidimensionality and local independence (Jason Fan and Trevor Bond)

This chapter discusses the two fundamental concepts required in the application of Rasch measurement in language assessment research, i.e., unidimensionality and local independence. It provides an accessible discussion of these concepts in the context of language assessment. The authors first explain how the two concepts should be perceived from a measurement perspective. This is followed by a brief explanation of the Rasch model, a description of how these two measurement properties are investigated through Rasch residuals, and a review of Rasch-based studies in language assessment that reports the existence of these properties to strengthen test validity claims. The authors demonstrate the investigation of these properties through the analysis of items in a listening test using the Partial Credit Rasch model. The results of the study revealed that the listening test is unidi-mensional and that the principal component analysis of residuals analysis provides evidence of local independence of items. The chapter concludes with a discussion of the practical considerations and suggestions on steps to take should test devel-opers encounter situations in which these properties of measurement are violated.

Chapter 5: The Rasch measurement approach to differential item functioning (DIF) analysis in language assessment research (Michelle Raquel)

This chapter continues the discussion of test measurement properties. Differential item functioning (DIF) is the statistical term used to describe items that inad-vertently have different item estimates for different subgroups because they are affected by characteristics of the test takers such as gender, age group, or ethnicity. The author first explains the concept of DIF and then provides a brief overview of different DIF detection methods used in language assessment research. A review of DIF studies in language testing follows, which includes a summary of current DIF studies, the DIF method(s) used, and whether the studies investigated the causes of DIF. The chapter then illustrates one of the most commonly used DIF detection methods, the Rasch-based DIF analysis method. The sample study investigates the presence of DIF in a diagnostic English listening test in which students were clas-sified according to the English language curriculum they have taken, Hong Kong vs. Macau. The results of the study revealed that although there were a significant number of items flagged for DIF, overall test results did not seem to be affected.

Chapter 6: Application of the rating scale model and the partial credit model in language assessment research (Ikkyu Choi)

This chapter introduces two Rasch models that are used to analyze polytomous data usually generated by performance assessments (speaking or writing tests) and

Page 22: Quantitative Data Analysis for - ESEC

Introduction 5

questionnaires used in language assessment studies. First, Ikkyu Choi explains the relationship of the Rating Scale Model (RSM) and the Partial Credit Model (PCM) through a gentle review of their algebraic representations. This is fol-lowed by a discussion of the differences of these models and a review of studies that have utilized this method. The author notes in his review that researchers rarely provide a rationale for the choice of model, and neither do they compare models. In the sample study investigating the scale of a motivation questionnaire, the author provides a thorough and graphic comparison and evaluation of the RSM and the PCM models and their impact on the scale structure of the ques-tionnaire. The chapter concludes by providing justification as to why the PCM was more appropriate for the context, the limitations of the parameter estimation method used by the sample study, and a list of suggested topics to extend their knowledge of the topic.

Chapter 7: Many-facet Rasch measurement: implications for rater-mediated language assessment (Thomas Eckes)

This chapter discusses one of the most popular item response theory (IRT)-based methods to analyze rater-mediated assessments. A common problem in speak-ing and writing tests is that the marks or grades are dependent on human raters who most likely have their own conceptions how to mark despite training, which impacts test reliability. Many-facet Rasch measurement (MFRM) provides a solu-tion to this problem in that the analysis simultaneously includes multiple facets such as raters, assessment criteria, test format, or the time when a test is taken. The author first provides an overview of rater-mediated assessments and MFRM concepts. The application of this method is illustrated through an analysis of a writing assessment in which the author demonstrates how to determine rater severity, consistency of ratings, and how to generate test scores after adjusting for differences in ratings. The chapter concludes with a discussion on advances in MFRM research and controversial issues related to this method.

Section III. Univariate and multivariate statistical analysis

Chapter 8: Analysis of differences between groups: the t-test and the analysis of variance (ANOVA) in language assessment (Tuğba Elif Toprak)

The third section of this volume starts with a discussion of two of the most fun-damental and commonly used statistical techniques used to compare test score results and determine whether differences between the groups are due to chance. For example, language testers often find themselves trying to compare two or multiple groups of test takers or compare pre-test and post-test scores. The chapter starts with an overview of t-tests and the analysis of variance (ANOVA) and the assumptions that must be met before embarking on these analyses. The literature review provides summary tables of recent studies that have employed each method. The application of the t-test is demonstrated through a sample

Page 23: Quantitative Data Analysis for - ESEC

6 Vahid Aryadoust and Michelle Raquel

study that investigated the impact of English songs on students’ pronuncia-tion development in which the author divided the students into two groups (experimental vs. control group) and then compared the groups’ results on a pronunciation test. The second study utilized ANOVA to determine if students’ academic reading proficiency differed across college years (freshmen, sopho-mores, juniors, seniors), and to determine which group was significantly different from the others.

Chapter 9: Application of ANCOVA and MANCOVA in language assessment research (Zhi Li and Michelle Y. Chen)

This chapter extends the discussion of methods used to compare test results. Instead of using one variable to classify groups that are compared, analysis of covariance (ANCOVA), and multivariate analysis of covariance (MANCOVA) consider multiple variables of multiple groups to determine whether differences in group scores are statistically significant. ANCOVA is used when there is only one independent variable, while MANCOVA is used when there are two or more independent variables that are included in the comparison. Both techniques control for the effect of one or more variables that co-vary with the dependent variables. The chapter begins with a brief discussion of these two methods, the situations in which they should be used, the assumptions that must be fulfilled before analysis can begin, and a brief discussion of how results should be reported. The authors present the results of their meta-analyses of studies that have utilized these methods and outline the issues related to results reporting in these studies. The application of these methods is demonstrated in the analyses of the Pro-gramme for International Student Assessment (PISA) 2009 reading test results of Canadian children.

Chapter 10: Application of linear regression in language assessment (Daeryong Seo and Husein Taherbhai)

There are cases in which language testers need to determine the impact of one variable on another variable, such as if someone’s first language has an impact on the learning of a second language. Linear regression is the appropriate statistical technique to use when one aims to determine the extent to which one or more independent variables linearly impact a dependent variable. This chapter opens with a brief discussion of the differences between single and multiple linear regression and a full discussion on the assumptions that must be fulfilled before commencing analysis. Next, the authors present a brief literature review of factors that affect English language proficiency, as these determine what variables should be included in the statistical model. The sample study illustrates the application of linear regression by predicting students’ results on an English language arts examination based on their performance in English proficiency tests of reading, listening, speaking, and writing. The chapter concludes with a checklist of con-cepts to consider before doing regression analysis.

Page 24: Quantitative Data Analysis for - ESEC

Introduction 7

Chapter 11: Application of exploratory factor analysis in language assessment (Limei Zhang and Wenshu Luo)

A standard procedure in test and survey development is to check and see whether a test or questionnaire measures one underlying construct or dimension. Ideally, test and questionnaire items are constructed to measure a latent construct (e.g., 20-items to measure listening comprehension), but each item is designed to measure different aspects of the construct (e.g., items that measure the ability to listen for details, ability to listen for main ideas, etc.). Exploratory factor analysis (EFA) is a statistical technique that examines how items are grouped together into themes and ultimately measure the latent trait. The chapter commences with an overview of EFA, the different methods to extract the themes (factors) from the data, and an outline of steps in conducting an EFA. This is followed by a literature review that highlights the different ways the method has been applied in language testing research, with specific focus on studies that confirm the factor structure of tests and questionnaires. The sample study demonstrates how EFA can do this by analyzing the factor structure of the Reading Test Strategy Use Questionnaire used to determine the types of reading strategies that Chinese students use as they complete reading comprehension tests.

Quantitative Data Analysis for Language Assessment Volume II: Advanced MethodsVolume II comprises three major categories of quantitative methods in language testing research: advanced IRT, advanced statistical methods, and nature-inspired data mining methods. We provide an overview of the sections and chapters below.

Section I. Advanced Item Response Theory (IRT) models in language assessment

Chapter 1: Mixed Rasch modeling in assessing reading comprehension (Purya Baghaei, Christoph Kemper, Samuel Greif, and Monique Reichert)

In this chapter, the authors discuss the application of the mixed Rasch model (MRM) in assessing reading comprehension. MRM is an advanced psychometric approach for detecting latent class differential item functioning (DIF) which conflates the Rasch model and latent class analysis. MRM relaxes some of the requirements of conventional Rasch measurement while preserving most of the fundamental features of the method. MRM further combines the Rasch model with latent class modeling, which classifies test takers into exclusive classes with qualitatively different features. Baghaei et al. apply the model to a high-stakes reading comprehension test in English as a foreign language and detect two latent classes of test takers for whom the difficulty level of the test items differs. They discuss the differentiating feature of the classes and conclude that MRM can be applied to identify sources of multidimensionality.

Page 25: Quantitative Data Analysis for - ESEC

8 Vahid Aryadoust and Michelle Raquel

Chapter 2: Multidimensional Rasch models in first language listening tests (Christian Spoden and Jens Fleischer)

Since the introduction of Rasch measurement to language assessment, a group of scholars has contended that language is not a unidimensional phenomenon, and, accordingly, unidimensional modeling of language assessment data (e.g., through the unidimensional Rasch model) would conceal the role of many lin-guistic features that are integral to language performance. The multidimensional Rasch model could be viewed as a response to these concerns. In this chapter, the authors provide a didactic presentation of the multidimensional Rasch model and apply it to a listening assessment. They discuss the advantages of adopting the model in language assessment research, specifically the improvement in the estimation of reliability as a result of the incorporation of dimension correlations, and explain how model comparison can be carried out while elaborating on multidimensionality in listening comprehension assessments. They conclude the chapter with a brief summary of other multidimensional Rasch models and their value in language assessment research.

Chapter 3: The Log-Linear Cognitive Diagnosis Modeling (LCDM) in second language listening assessment (Elif Toprak, Vahid Aryadoust, and Christine Goh)

Another group of multidimensional models, called cognitive diagnostic mod-els (CDMs), combines psychometrics and psychology. One of the differences between CDMs and the multidimensional Rasch models is that the former family estimates sub-skills mastery of test takers, whereas the latter group provides gen-eral estimation of ability for each sub-skill. In this chapter, the authors introduce the Log-Linear Cognitive Diagnosis Modeling (LCDM), which is a flexible CDM technique for modeling assessment data. They apply the model to a high-stakes norm-referenced listening test (a practice that is known as retrofitting) to deter-mine whether they can derive diagnostic information concerning test takers’ weaknesses and strengths. Toprak et al. argue that although norm-referenced assessments do not usually provide such diagnostic information about the lan-guage abilities of test takers, providing such information is practical, as it helps language learners who wish to know this information to improve their language skills. They provide guidelines on the estimation and fitting of the LCDM, which is also applicable to other CDM techniques.

Chapter 4: Hierarchical diagnostic classification models in assessing reading comprehension (Hamdollah Ravand)

In this chapter, the author presents another group of CDM techniques including the deterministic noisy and gate (DINA) model and the generalized deterministic noisy and gate (G-DINA) model, which are increasingly attracting more atten-tion in language assessment research. Ravand begins the chapter by providing

Page 26: Quantitative Data Analysis for - ESEC

Introduction 9

step-by-step guidelines to model selection, development, and evaluation, elabo-rating on fit statistics and other relevant concepts in CDM analysis. Like Toprak et al. who presented the LCDM in Chapter 3, Ravand argues for retrofitting CDMs to norm-referenced language assessments and provides an illustrative example of the application of CDMs to a non-diagnostic high-stakes test of read-ing. He further explains how to use and interpret fit statistics (i.e., relative and absolute fit indices) to select the optimal model among the available CDMs.

Section II. Advanced statistical methods in language assessment

Chapter 5: Structural equation modeling in language assessment (Xuelian Zhu, Michelle Raquel, and Vahid Aryadoust)

This chapter discusses one of the most commonly used techniques in the field, whose application in assessment research goes back to, at least, the 1990s. Instead of modeling a linear relationship of variables, structural equation modeling (SEM) is used to concurrently model direct and indirect relationships between variables. The authors first provide a review of SEM in language assessment research and propose a framework for model development, specification, and validation. They discuss the requirements of sample size, fit, and model respecification and apply SEM to confirm the use of a diagnostic test in predicting the proficiency level of test takers as well as the possible mediating role for some demographic factors in the model tested. While SEM can be applied to both dichotomous and polyto-mous data, the authors focus on the latter group of data, while stressing that the principles and guidelines spelled out are directly applicable to dichotomous data. They further mention other applications of SEM such as multigroup modeling and SEM of dichotomous data.

Chapter 6: Growth modeling using growth percentiles for longitudinal studies (Daeryong Seo and Husein Taherbhai)

This chapter presents a method for modeling growth that is called student growth percentile (SGP) for longitudinal data, which is estimated by using the quantile regression method. A distinctive feature of SGP is that it compares test takers with those who had the same history of test performance and achievement. This means that even when the current test scores are the same for two test takers with differ-ent assessment histories, their actual SGP scores on the current test can be differ-ent. Another feature of SGP that differentiates it from similar techniques such as MLM and latent growth curve models is that SGP does not require test equating, which, in itself, could be a time-consuming process. Oftentimes, researchers and language teachers wish to determine whether a particular test taker has a chance to achieve a pre-determined cut score, but a quick glance at the available literature shows that the quantitative tools available do not provide such information. Seo and Taherbhai show that through the quantile regression method, one can esti-mate the propensity of test takers to achieve an SGP score required to reach the

Page 27: Quantitative Data Analysis for - ESEC

10 Vahid Aryadoust and Michelle Raquel

cut score. The technique lends itself to investigation of change in four language modalities, i.e., reading, writing, listening, and speaking.

Chapter 7: Multilevel modeling to examine sources of variability in second language test scores (Yo In’nami and Khaled Barkaoui)

Multilevel modeling (MLM) is based on the premise that test takers’ perfor-mance is a function of students’ measured abilities as well as another level of variation such as the classrooms, schools, or cities the test takers come from. According to the authors, MLM is particularly useful when test takers are from pre-specified homogeneous subgroups such as classrooms, which have differ-ent characteristics from test takers placed in other subgroups. The between-subgroup heterogeneity combined with the within-subgroup homogeneity yield a source of variance in data which, if ignored, can inflate chances of a Type I error (i.e., rejection of a true null hypothesis). The authors provide guidelines and advice on using MLM and showcase the application of the technique to a second-language vocabulary test.

Chapter 8: Longitudinal multilevel modeling to examine changes in second language test scores (Khaled Barkaoui and Yo In’nami)

In this chapter, the authors propose that flexibility of MLM renders it well suited for modeling growth and investigating the sensitivity of test scores to change over time. The authors argued that MLM is an alternative hierarchical method to linear methods such as analysis of variance (ANOVA) and linear regression. They present an example of second-language longitudinal data. They encour-age MLM users to consider and control for the variability of test forms that can confound assessments over time to ensure test equity before using test scores for MLM analysis and to maximize the validity of the uses and interpretations of the test scores.

Section III. Nature-inspired data mining methods in language assessment

Chapter 9: Classification and regression trees in predicting listening item difficulty (Vahid Aryadoust and Christine Goh)

The first data mining method in this section is classification and regression tress (CART), which is presented by Aryadoust and Goh. CART is used in a similar way as linear regression or classification techniques are used in predic-tion and classification research. CART, however, relaxes the normality and other assumptions that are necessary for parametric models such as regression analysis. Aryadoust and Goh review the literature on the application of CART in language

Page 28: Quantitative Data Analysis for - ESEC

Introduction 11

assessment and propose a multi-stage framework for CART modeling that starts with the establishing of theoretical frameworks and ends in cross-validation. The authors apply CART to 321 listening test items and generate a number of IF-THEN rules which link item difficulty to the linguistic features of the items in a non-linear way. The chapter also stresses the role of cross-validation in CART modeling and the features of two cross-validation methods (n-fold cross-validation and train-test cross).

Chapter 10: Evolutionary algorithm-based symbolic regression to determine the relationship of reading and lexico-grammatical knowledge (Vahid Aryadoust)

Aryadoust introduces the evolutionary algorithm-based (EA-based) symbolic regression method and showcases it application in reading assessment. Like CART, EA-based symbolic regression is a non-linear data analysis method that comprises a training and a cross-validation stage. The technique is inspired by the principles of Darwinian evolution. Accordingly, concepts such as survival of the fittest, offspring, breeding, chromosomes, and cross-over are incorporated into the mathematical modeling procedures. The non-parametric nature and cross-validation capabilities of EA-based symbolic regression render it a powerful classification and prediction model in language assessment. Aryadoust presents a prediction study in which he adopts lexico-grammatical abilities as independent variables to predict the reading ability of English learners. He compared the pre-diction power of the method with that of a linear regression model and showed that the technique renders more precise solutions.

ConclusionIn sum, Volumes I and II present 23 fundamental and advanced quantitative methods and their applications in language testing research. An important fac-tor to consider in choosing these fundamental and advanced methods is the role of theory and the nature of research questions. Although some may be drawn to use advanced methods, as they might provide stronger evidence to support validity and reliability claims, in some cases, using less complex methods might cater to the needs of researchers. Nevertheless, oversimplifying research problems could result in overlooking significant sources of variation in data and drawing possibly wrong or naïve inferences. The authors of the chapters have, therefore, emphasized that the first step to choosing the methods is the postulation of theoretical frameworks to specify the nature of relationships among variables, processes, and mechanisms of the attributes under investiga-tion. Only after establishing the theoretical framework should one proceed to select quantitative methods to test the hypotheses of the study. To this end, the chapters in the volumes provide step-by-step guidelines to achieve accuracy and

Page 29: Quantitative Data Analysis for - ESEC

12 Vahid Aryadoust and Michelle Raquel

precision in choosing and conducting the relevant quantitative techniques. We are confident that the joint effort of the authors has emphasized the research rigor required in the field and highlighted strengths and weaknesses of the data analysis techniques.

ReferencesBorsboom, D. (2005). Measuring the mind: Conceptual issues in contemporary psycho-

metrics. Cambridge: Cambridge University Press.Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (Eds.). (2008). Building a validity

argument for the test of English as a foreign language. New York, NY: Routledge.

Page 30: Quantitative Data Analysis for - ESEC

Introduction Borsboom, D. (2005). Measuring the mind: Conceptual issues in contemporary psycho-metrics.Cambridge: Cambridge University Press. Chapelle, C. A. , Enright, M. K. , & Jamieson, J. M. (Eds.). (2008). Building a validity argumentfor the test of English as a foreign language. New York, NY: Routledge.

Item analysis in language assessment Alderson, J. C. (2007). Final report on the ELPAC validation study. Retrieved fromwww.elpac.info/ Alderson, J. C. (2010). A survey of aviation English tests. Language Testing, 27(1), 51���72. Alderson, J. C. , & Huhta, A. (2005). The development of a suite of computer-based diagnostictests based on the Common European Framework. Language Testing, 22(3), 301���320. Alderson, J. C. , Percsich, R. , & Szabo, G. (2000). Sequencing as an item type. LanguageTesting, 17(4), 423���447. Anderson, N. J. , Bachman, L. , Perkins, K. , & Cohen, A. (1991). An exploratory study into theconstruct validity of a reading comprehension test: Triangulation of data sources. LanguageTesting, 8(1), 41���66. Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: CambridgeUniversity Press. Bhumichitr, D. , Gardner, D. , & Green, R. (2013, July). Developing a test for diplomats:Challenges, impact and accountability. Paper presented at the Language Testing ResearchColloquium, Seoul, South Korea. Campfield, D. E. (2017). Lexical difficulty: Using elicited imitation to study child L2. LanguageTesting, 34(2), 197���221. Cizek, G. J. , & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluatingperformance standards on tests. Thousand Oaks, CA: SAGE Publications. Culligan, B. (2015). A comparison of three test formats to assess word difficulty. LanguageTesting, 32(4), 503���520. Currie, M. , & Chiramanee, T. (2010). The effect of the multiple-choice item format on themeasurement of knowledge of language structure. Language Testing, 27(4), 471���491. Field, A. (2009). Discovering statistics using SPSS. London: SAGE Publications. Fortune, A. (2004). Testing listening comprehension in a foreign language: Does the number oftimes a text is heard affect performance. Unpublished MA dissertation, Lancaster University,United Kingdom. Green, R. (2005). English Language Proficiency for Aeronautical Communication ��� ELPAC.Paper presented at the Language Testing Forum (LTF), University of Cambridge. 29 Green, R. (2013). Statistical analyses for language testers. New York, NY: PalgraveMacmillan. Green, R. (2017). Designing listening tests: A practical approach. London: Palgrave Macmillan. Green, R. , & Spoettl, C. (2009, June). Going national, standardized and live in Austria:Challenges and tensions. Paper presented at the EALTA Conference, Turku, Finland. Retrievedfrom www.ealta.eu.org/conference/2009/programme.htm Green, R. , & Spoettl, C. (2011, May). Building up a pool of standard setting judges: Problems,solutions and insights. Paper presented at the EALTA Conference, Siena, Italy. Retrieved fromwww.ealta.eu.org/conference/2011/programme.html Henning, G. (1987). A guide to language testing: Development, evaluation, research.Cambridge: Newbury House Publishers. Hsu, T. H. L. (2016). Removing bias towards World Englishes: The development of a raterattitude instrument using Indian English as a stimulus. Language Testing, 33(3), 367���389. Ilc, G. , & Stopar, A. (2015). Validating the Slovenian national alignment to CEFR: The case ofthe B2 reading comprehension examination in English. Language Testing, 32(4), 443���462. Jafarpur, A. (2003). Is the test constructor a facet? Language Testing, 20(1), 57���87. Jang, E. E. , Dunlop, M. , Park, G. , & van der Boom, E. H. (2015). How do young students withdifferent profiles of reading skill mastery, perceived ability, and goal orientation respond toholistic diagnostic feedback? Language Testing, 32(3), 359���383. Kobayashi, M. (2002). Method effects on reading comprehension test performance: Textorganization and response format. Language Testing, 19(2), 193���220. LaFlair, G. T. , Isbell, D. , May, L. N. , Gutierrez Arvizu, M. N. , & Jamieson, J. (2015). Equatingin small-scale language testing programs. Language Testing, 34(1), 127���144.

Page 31: Quantitative Data Analysis for - ESEC

Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York, NY: McGraw-Hill. Pallant, J. (2007). SPSS survival manual: A step by step guide to data analysis using SPSS forWindows (3rd ed.). New York, NY: Open University Press. Papageorgiou, S. (2016). Aligning language assessments to standards and frameworks. In D.Tsagari & J. Banerjee (Eds.), Handbook of second language assessment (pp. 327���340). Berlin:Walter de Gruyter Inc. Popham, W. J. (2000). Modern educational measurement: Practical guidelines for educationalleaders. Boston, MA: Allyn & Bacon. Salkind, N. J. (2006). Tests & measurement for people who (think they) hate tests &measurement. Thousand Oaks, CA: SAGE Publications. Sarig, G. (1989). Testing meaning construction: Can we do it fairly? Language Testing, 6(1),77���94. Shizuka, T. , Takeuchi, O. , Yashima, T. , & Yoshizawa, K. (2006). A comparison of three-andfour-option English tests for university entrance selection purposes in Japan. Language Testing,23(1), 35���57. Wagner, E. (2008). Video listening tests: What are they measuring? Language AssessmentQuarterly, 5(3), 218���243. Wagner, E. (2010). The effect of the use of video texts on ESL listening test-taker performance.Language Testing, 27(4), 493���513. Zieky, M. J. , Perie, M. , & Livingston, S. A. (2008). Cut-scores: A manual for setting standardsof performance on educational and occupational tests. Princeton, NJ: Educational TestingService.

Univariate generalizability theory in language assessment Atilgan, H. (2013). Sample size for estimation of g and phi coefficients in generaliz-ability theory.Eurasian Journal of Educational Research, 51, 215���228. Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: CambridgeUniversity Press. Bachman, L. F. , Lynch, B. K. , & Mason, M. (1995). Investigating variability in tasks and raterjudgment in a performance test of foreign language speaking. Language Testing, 12(2),238���257. Brennan, R. L. (1983). Elements of generalizability theory. Iowa City, IA: American CollegeTesting. Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer-Verlag, Inc. Brennan, R. L. , Gao, X. , & Colton, D. A. (1995). Generalizability analyses of Work Keyslistening and writing tests. Educational and Psychological Measurement, 55(2), 157���176. Brown, J. D. (1999). The relative importance of persons, items, subtests and languages toTOEFL test variance. Language Testing, 16(2), 217���238. Brown, J. D. , & Bailey, K. M. (1984). A categorical instrument for scoring second languagewriting skills. Language Learning, 34(4), 21���38. Brown, J. D. , & Hudson, T. (2002). Criterion-referenced language testing. Cambridge:Cambridge University Press. Cardinet, J. , Johnson, S. , & Pini, G. (2009). Applying generalizability theory using EduG. NewYork, NY: Routledge. Chiu, C. W. T. (2001). Scoring performance assessments based on judgments: Generaliz-abilitytheory. Boston, MA: Kluwer Academic. Chiu, C. W. T. , & Wolfe, E. W. (2002). A method for analyzing sparse data matrices in thegeneralizability theory framework. Applied Psychological Measurement, 26(3), 321���338. Crick, J. E. , & Brennan, R. L. (2001). GENOVA (Version 3.1) [Computer program]. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16,297���334. Cronbach, L. J. , Gleser, G. C. , Nanda, H. , & Rajaratnam, N. (1972). The dependability ofbehavioral measurements: Theory of generalizability for scores and profiles. New York, NY:Wiley. Hoyt, W. T. (2010). Inter-rater reliability and agreement. In G. R. Hancock & R. O. Mueller(Eds.), The reviewer���s guide to quantitative methods in the social sciences (pp. 141���154). NewYork, NY: Routledge. In���nami, Y. , & Koizumi, R. (2016). Task and rater effects in L2 speaking and writing: A synthesisof generalizability studies. Language Testing, 33(3), 341���366.

Page 32: Quantitative Data Analysis for - ESEC

Lee, Y.-W. (2006). Dependability of scores for a new ESL speaking assessment consisting ofintegrated and independent tasks. Language Testing, 23(2), 131���166. Lee, Y.-W. , & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluatingprototype tasks and alternative rating schemes. TOEFL Monograph Series No. 31. Princeton,NJ: Educational Testing Service. 53 Lin, C.-K. (2017). Working with sparse data in rated language tests: Generalizability theoryapplications. Language Testing, 34(2), 271���289. Lynch, B. K. , & McNamara, T. F. (1998). Using G-theory and many-facet Rasch measurementin the development of performance of ESL speaking skills of immigrants. Language Testing,15(2), 158���180. Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York, NY: McGraw-Hill. Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking assessment:Reporting a score profile. Language Testing, 24(3), 355���390. Sawaki, Y. (2013). Classical test theory. In A. Kunnan (Ed.), The companion to languageassessment (pp. 1147���1164). New York, NY: Wiley. Sawaki, Y. , & Sinharay, S. (2013). The value of reporting TOEFL iBT subscores. TOEFL iBTResearch Report No. TOEFLiBT-21. Princeton, NJ: Educational Testing Service. Schoonen, R. (2005). Generalizability of writing scores: An application of structural equationmodeling. Language Testing, 22(1), 1���30. Shavelson, R. J. , & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA:SAGE Publications. Xi, X. (2007). Evaluating analytic scores for the TOEFL�� Academic Speaking Test (TAST) foroperational use. Language Testing, 24(2), 251���286. Xi, X. , & Mollaun, P. (2011). Using raters from India to score a large-scale speaking test.Language Learning, 61(4), 1222���1255. Zhang, S. (2006). Investigating the relative effects of persons, items, sections, and languageson TOEIC score dependability. Language Testing, 23(3), 351���369.

Multivariate generalizability theory in language assessment Alderson, J. C. (1981). Report of the discussion on the testing of English for specific purposes.In J. C. Alderson & A. Hughes (Eds.), Issues in language testing. ELT Documents No. 111 (pp.187���194). London: British Council. Atilgan, H. (2013). Sample size for estimation of G and phi coefficients in generaliz-abilitytheory. Eurasian Journal of Educational Research, 51, 215���227. Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: OxfordUniversity Press. Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: CambridgeUniversity Press. Bachman, L. F. (2005). Building and supporting a case for test use. Language AssessmentQuarterly, 2, 1���34. Bachman, L. F. , Lynch, B. K. , & Mason, M. (1995). Investigating variability in tasks and raterjudgment in a performance test of foreign language speaking. Language Testing, 12(2),238���257. Bachman, L. F. , & Palmer, A. (1982). The construct validation of some components ofcommunicative proficiency. TESOL Quarterly, 16, 446���465. Bachman, L. F. , & Palmer, A. (1996). Language testing in practice. Oxford: Oxford UniversityPress. Brennan, R. L. (1983). Elements of generalizability theory. Iowa City, IA: ACT, Inc. Brennan, R. L. (1992). Generalizability theory. Educational Measurement: Issues and Practice,11(4), 27���34. Brennan, R. L. (2001a). Generalizability theory. New York, NY: Springer-Verlag. Brennan, R. L. (2001b). mGENOVA (Version 2.1) [Computer software]. Iowa City, IA: TheUniversity of Iowa. Retrieved from https://education.uiowa.edu/centers/center-advanced-studies-measurement-and-assessment/computer-programs Brennan, R. L. (2011). Generalizability theory and classical test theory. Applied Measurement inEducation, 24(1), 1���21. doi: 10.1080/08957347.2011.532417 Brennan, R. L. , Gao, X. , & Colton, D. A. (1995). Generalizability analyses of work keyslistening and writing tests. Educational and Psychological Measurement, 55(2), 157���176.

Page 33: Quantitative Data Analysis for - ESEC

79 Chiu, C. , & Wolfe, E. (2002). A method for analyzing sparse data matrices in thegeneralizability theory framework. Applied Psychological Measurement, 26(3), 321���338. Cronbach, L. J. , Gleser, G. C. , Nanda, H. , & Rajaratnam, N. (1972). The dependability ofbehavioral measurement: Theory of generalizability for scores and profiles. New York, NY:Wiley. Davies, A. , Brown, A. , Elder, E. , Hill, K. , Lumley, T. , & McNamara, T. (1999). Dictionary oflanguage testing. Studies in Language Testing, 7. Cambridge: Cambridge University Press. Frost, K. , Elder, C. , & Wigglesworth, G. (2011). Investigating the validity of an integratedlistening-speaking task: A discourse-based analysis of test takers��� oral performances. LanguageTesting, 29(3), 345���369. Grabowski, K. C. (2009). Investigating the construct validity of a test designed to measuregrammatical and pragmatic knowledge in the context of speaking. Unpublished doctoraldissertation, Teachers College, Columbia University. In���nami, Y. , & Koizumi, R. (2015). Task and rater effects in L2 speaking and writing: A synthesisof generalizability studies. Language Testing, 33(3), 341���366. Lee, Y.-W. (2006). Dependability of scores for a new ESL speaking assessment consisting ofintegrated and independent tasks. Language Testing, 23(2), 131���166. Lee, Y.-W. , & Kantor, R. (2007). Evaluating prototype tasks and alternative rating schemes fora new ESL writing test through g-theory. International Journal of Testing, 7(4), 353���385. Liao, Y.-F. (2016). Investigating the score dependability and decision dependability of the GEPTlistening test: A multivariate generalizability theory approach. English Teaching and Learning,40(1), 79���111. Lin, R. (2017, June). Operationalizing content integration in analytic scoring: Assessinglistening-speaking ability in a scenario-based assessment. Paper presented at the 4th AnnualInternational Conference of the Asian Association for Language Assessment (AALA), Taipei. Linacre, M. (2001). Generalizability theory and Rasch measurement. Rasch MeasurementTransactions, 15(1), 806���807. Lynch, B. K. , & McNamara, T. (1998). Using G-theory and many-facet Rasch measurement inthe development of performance assessments of the ESL speaking skills of migrants. LanguageTesting, 15(2), 158���180. McNamara, T. (1996). Measuring second language test performance. New York, NY: Longman. Plakans, L. (2013). Assessment of integrated skills. In C. A. Chapelle (Ed.), The encyclopedia ofapplied linguistics (pp. 205���212). Malden, MA: Blackwell. Sato, T. (2011). The contribution of test-takers��� speech content to scores on an English oralproficiency test. Language Testing, 29(2), 223���241. Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking assessment:Reporting a score profile and a composite. Language Testing, 24(3), 355���390. Shavelson, R. , & Webb, N. (1991). Generalizability theory: A primer. Newbury Park, CA: SAGEPublications. Wang, M. W. , & Stanley, J. C. (1970). Differential weighting: A review of methods and empiricalstudies. Review of Educational Research, 40(5), 663���705. Webb, N. M. , Shavelson, R. J. , & Maddahian, E. (1983). Multivariate generalizability theory. InL. J. Fyans (Ed.), New directions in testing and measurement: Generaliz-ability theory (pp.67���82). San Francisco, CA: Jossey-Bass. 80 Weigle, S. (2004). Integrating reading and writing in a competency test for non-nativespeakers of English. Assessing Writing, 9, 27���55. Xi, X. (2007). Evaluating analytic scoring for the TOEFL�� Academic Speaking Test (TAST) foroperational use. Language Testing, 24(2), 251���286. Xi, X. , & Mollaun, P. (2014). Investigating the utility of analytic scoring for the TOEFL AcademicSpeaking Test (TAST). ETS Research Report Series, 2006(1), 1���71. Zhang, S. (2006). Investigating the relative effects of persons, items, sections, and language onTOEIC score dependability. Language Testing, 23(3), 351���369.

Applying Rasch measurement in language assessment Aryadoust, V. , Goh, C. C. , & Kim, L. O. (2011). An investigation of differential item functioningin the MELAB listening test. Language Assessment Quarterly, 8(4), 361���385. Bachman, L. F. (2000). Modern language testing at the turn of the century: Assuring that whatwe count counts. Language Testing, 17(1), 1���42. Bachman, L. F. , & Palmer, A. S. (1996). Language assessment in practice: Designing anddeveloping useful language tests. Oxford: Oxford University Press.

Page 34: Quantitative Data Analysis for - ESEC

Bachman, L. F. , & Palmer, A. S. (2010). Language assessment in practice: Developinglanguage assessments and justifying their use in the real world. Oxford: Oxford UniversityPress. Beglar, D. (2010). A Rasch-based validation of the Vocabulary Size Test. Language Testing,27(1), 101���118. Bejar, I. I. (1983). Achievement testing: Recent advances. Beverly Hills, CA: SAGEPublications. Bond, T. , & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in thehuman sciences. New York, NY: Routledge. Buck, G. (2001). Assessing listening. Cambridge: Cambridge University Press. Chapelle, C. A. , Enright, M. K. , & Jamieson, J. M. (2008). Building a validity argument for thetest of English as a foreign language. New York, NY and London: Routledge, Taylor & FrancisGroup. Chen, W.-H. , & Thissen, D. (1997). Local dependence indexes for item pairs using itemresponse theory. Journal of Educational and Behavioral Statistics, 22(3), 265���289. Chou, Y. T. , & Wang, W. C. (2010). Checking dimensionality in item response models withprincipal component analysis on standardized residuals. Educational and PsychologicalMeasurement, 70(5), 717���731. Christensen, K. B. , Makransky, G. , & Horton, M. (2017). Critical values for Yen���s Q 3:Identification of local dependence in the Rasch model using residual correlations. AppliedPsychological Measurement, 41(3), 178���194. Eckes, T. (2011). Introduction to many-facet Rasch measurement. Frankfurt am Main: PeterLang. Fan, J. , & Ji, P. (2014). Test candidates��� attitudes and their test performance: The case of theFudan English Test. University of Sydney Papers in TESOL, 9, 1���35. Fan, J. , Ji, P. , & Song, X. (2014). Washback of university-based English language tests onstudents��� learning: A case study. The Asian Journal of Applied Linguistics, 1(2), 178���192. 101 Fan, J. , & Yan, X. (2017). From test performance to language use: Using self-assessmentto validate a high-stakes English proficiency test. The Asia-Pacific Education Researcher,26(1���2), 61���73. FDU Testing Team . (2014). The Fudan English test syllabus. Shanghai: Fudan UniversityPress. Ferguson, G. A. (1941). The factorial interpretation of test difficulty. Psychometrika, 6(5),323���330. Ferne, T. , & Rupp, A. A. (2007). A synthesis of 15 years of research on DIF in languagetesting: Methodological advances, challenges, and recommendations. Language AssessmentQuarterly, 4(2), 113���148. Field, A. (2009). Discover statistics using SPSS. London: SAGE Publications. Hamp-Lyons, L. (1989). Applying the partial credit method of Rasch analysis: Language testingand accountability. Language Testing, 6(1), 109���118. Linacre, J. M. (1998). Detecting multidimensionality: Which residual data-type works best?Journal of Outcome Measurement, 2, 266���283. Linacre, J. M. (2017a). Winsteps�� (Version 3.93.0) [Computer software]. Beaverton, OR:Winsteps.com. Retrieved January 1, 2017 from www.winsteps.com. Linacre, J. M. (2017b). Facets computer program for many-facet Rasch measurement, version3.80.0 user���s guide. Beaverton, OR: Winsteps.com. Linacre, J. M. (2017c). Winsteps �� Rasch measurement computer program user���s guide.Beaverton, OR: Winsteps.com. Linacre, J. M. (May, 2017). Personal Communication. Marais, I. (2013). Local dependence. In K. B. Christensen , S. Kreiner , & M. Mesbah (Eds.),Rasch models in health (pp. 111���130). London: John Wiley & Sons Ltd. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149���174. McNamara, T. (1996). Measuring second language proficiency. London: Longman PublishingGroup. McNamara, T. , & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurementin language testing. Language Testing, 29(4), 553���574. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13���103).New York, NY: American Council on Education/Macmillan Publishing Company. Min, S. , & He, L. (2014). Applying unidimensional and multidimensional item response theorymodels in testlet-based reading assessment. Language Testing, 31(4), 453���477. Pae, H. K. , Greenberg, D. , & Morris, R. D. (2012). Construct validity and measurementinvariance of the Peabody Picture Vocabulary Test ��� III Form A. Language AssessmentQuarterly, 9(2), 152���171.

Page 35: Quantitative Data Analysis for - ESEC

Sawaki, Y. , Stricker, L. J. , & Oranje, A. H. (2009). Factor structure of the TOEFL Internet-based test. Language Testing, 26(1), 5���30. Skehan, P. (1989). Language testing part II. Language Teaching, 22(1), 1���13. Stevens, J. (2002). Applied multivariate statistics for the social sciences. Mahwah, NJ:Lawrence Erlbaum Associates, Inc. Taylor, L. , & Geranpayeh, A. (2011). Assessing listening for academic purposes: Defining andoperationalising the test construct. Journal of English for Academic Purposes, 10(2), 89���101. Thissen, D. , Steinberg, L. , & Mooney, J. A. (1989). Trace lines for testlets: A use of multiple-categorical-response models. Journal of Educational Measurement, 26(3), 247���260. 102 Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance ofthe three-parameter logistic model. Applied Psychological Measurement, 8(2), 125���145. Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local itemdependence. Journal of Educational Measurement, 30(3), 187���213. Zhang, B. (2010). Assessing the accuracy and consistency of language proficiencyclassification under competing measurement models. Language Testing, 27(1), 119���140.

The Rasch measurement approach to differential item functioning (DIF)analysis in language assessment research Abbott, M. L. (2007). A confirmatory approach to differential item functioning on an ESL readingassessment. Language Testing, 24(1), 7���36. doi: 10.1177/0265532207071510 Allalouf, A. (2003). Revising translated differential item functioning items as a tool for improvingcross-lingual assessment. Applied Measurement in Education, 16(1), 55���73. doi:10.1207/S15324818AME1601_3 Allalouf, A. , & Abramzon, A. (2008). Constructing better second language assessments basedon differential item functioning analysis. Language Assessment Quarterly, 5(2), 120���141. doi:10.1080/15434300801934710 Allalouf, A. , Hambleton, R. K. , & Sireci, S. G. (1999). Identifying the causes of DIF in translatedverbal items. Journal of Educational Measurement, 36(3), 185���198. Aryadoust, V. (2012). Differential item functioning in while-listening performance tests: The caseof the International English Language Testing System (IELTS) listening module. InternationalJournal of Listening, 26(1), 40���60. doi: 10.1080/10904018.2012.639649 Aryadoust, V. , Goh, C. C. M. , & Kim, L. O. (2011). An investigation of differential itemfunctioning in the MELAB listening test. Language Assessment Quarterly, 8(4), 361���385. doi:10.1080/15434303.2011.628632 Bachman, L. F. (2005). Building and supporting a case for test use. Language AssessmentQuarterly, 2(1), 1���34. doi: 10.1207/s15434311laq0201_1 Bachman, L. F. , & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford UniversityPress. Bae, J. , & Bachman, L. F. (1998). A latent variable approach to listening and reading: Testingfactorial invariance across two groups of children in the Korean/English two-way immersionprogram. Language Testing, 15(3), 380���414. doi: 10.1177/026553229801500304 Banerjee, J. , & Papageorgiou, S. (2016). What���s in a topic? Exploring the interaction betweentest-taker age and item content in high-stakes testing. International Journal of Listening, 30(1���2),8���24. doi: 10.1080/10904018.2015.1056876 Bauer, D. J. (2016). A more general model for testing measurement invariance and differentialitem functioning. Psychological Methods. doi: 10.1037/met0000077 126 Bollmann, S. , Berger, M. , & Tutz, G. (2017). Item-focused trees for the detection ofdifferential item functioning in partial credit models. Educational and PsychologicalMeasurement, 78(5), 781���804. doi: 10.1177/0013164417722179 Bond, T. G. , & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in thehuman sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates. Borsboom, D. , Mellenbergh, G. J. , & Van Heerden, J. (2002). Different kinds of DIF: Adistinction between absolute and relative forms of measurement invariance and bias. AppliedPsychological Measurement, 26(4), 433���450. doi: 10.1177/014662102237798 Bray, M. , Butler, R. , Hui, P. , Kwo, O. , & Mang, E. (2002). Higher education in Macau: Growthand strategic development. Hong Kong: Comparative Education Research Centre (CERC), TheUniversity of Hong Kong. Bray, M. , & Koo, R. (2004). Postcolonial patterns and paradoxes: Language and education inHong Kong and Macao. Comparative Education, 40(2), 215���239.

Page 36: Quantitative Data Analysis for - ESEC

Buck, G. (2001). Assessing listening. Cambridge: Cambridge University Press. Bulut, O. , Quo, Q. , & Gierl, M. J. (2017). A structural equation modeling approach forexamining position effects in large-scale assessments. Large-Scale Assessments in Education,5(1), 8. doi: 10.1186/s40536-017-0042-x Carlson, J. E. , & von Davier, M. (2017). Item response theory. In R. E. Bennett & M. vonDavier(Eds.), Advancing human assessment (pp. 133���160). Cham, Switzerland: Springer Open. doi:10.1007/978-3-319-58689-2 Cheng, Y. , Shao, C. , & Lathrop, Q. N. (2015). The mediated MIMIC model for understandingthe underlying mechanism of DIF. Educational and Psychological Measurement, 76(1), 43���63.doi: 10.1177/0013164415576187 Dorans, N. J. (2017). Contributions to the quantitative assessment of item, test, and scorefairness. In R. E. Bennett & M. vonDavier (Eds.), Advancing human assessment (pp. 204���230).Cham, Switzerland: Springer Open. doi: 10.1007/978-3-319-58689-2 Elder, C. (1996). The effect of language background on ���Foreign��� language test performance:The case of Chinese, Italian, and modern Greek. Language Learning, 46(2), 233���282. doi:10.1111/j.1467-1770.1996.tb01236.x Elder, C. , McNamara, T. , & Congdon, P. (2003). Rasch techniques for detecting bias inperformance assessments: An example comparing the performance of native and non-nativespeakers on a test of academic English. Journal of Applied Measurement, 4(2), 181���197. Engelhard Jr., G. (2013). Invariant measurement: Using Rasch models in the social, behavioraland health sciences. New York, NY: Routledge. Engelhard Jr., G. , Kobrin, J. L. , & Wind, S. A. (2014). Exploring differential subgroupfunctioning on SAT writing items: What happens when English is not a test taker���s bestlanguage? International Journal of Testing, 14(4), 339���359. doi: 10.1080/15305058.2014.931281 Evans, S. , & Morrison, B. (2012). Learning and using English at university: Lessons from alongitudinal study in Hong Kong. The Journal of Asia TEFL, 9(2), 21���47. Ferne, T. , & Rupp, A. A. (2007). A synthesis of 15 years of research on DIF in languagetesting: Methodological advances, challenges, and recommendations. Language AssessmentQuarterly, 4(2), 113���148. doi: 10.1080/15434300701375923 Field, J. (1998). Skills and strategies: Towards a new methodology for listening. ELT Journal,52(2), 110���118. Field, J. (2005). Intelligibility and the listener: The role of lexical stress. TESOL Quarterly, 39(3),399���423. 127 Filipi, A. (2012). Do questions written in the target language make foreign languagelistening comprehension tests more difficult? Language Testing, 29(4), 511���532. doi:10.1177/0265532212441329 Freedle, R. , & Kostin, I. (1997). Predicting black and white differential item functioning in verbalanalogy performance. Intelligence, 24(3), 417���444. doi: 10.1016/S0160-2896(97)90058-1 Geranpayeh, A. , & Kunnan, A. J. (2007). Differential item functioning in terms of age in theCertificate in Advanced English examination. Language Assessment Quarterly, 4(2), 190���222.doi: 10.1080/15434300701375758 Gierl, M. J. , & Khaliq, S. N. (2006). Identifying sources of differential item and bundlefunctioning on translated achievement tests: A confirmatory analysis. Journal of EducationalMeasurement, 38(2), 164���187. doi: 10.1111/j.1745-3984.2001.tb01121.x Gnaldi, M. , & Bacci, S. (2016). Joint assessment of the latent trait dimensionality and observeddifferential item functioning of students��� national tests. Quality and Quantity, 50(4), 1429���1447.doi: 10.1007/s11135-015-0214-0 Harding, L. (2012). Accent, listening assessment and the potential for a shared-L1 advantage: ADIF perspective. Language Testing, 29(2), 163���180. doi: 10.1177/0265532211421161 Hidalgo, M. D. , & G��mez-Benito, J. (2010). Education measurement: Differential itemfunctioning. In P. Peterson , E. Baker , & B. McGaw (Eds.), International encyclopedia ofeducation (3rd ed., Vol. 4, pp. 36���44). Oxford, UK: Elsevier. Holland, P. W. , & Thayer, D. T. (1988). Differential item performance and the Mantel��� Haenszelprocedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 76���86). Hillsdale, NJ: Routledge. Huang, X. (2010). Differential item functioning: The consequence of language, curriculum, orculture? PhD dissertation, University of California. Huang, X. , Wilson, M. , & Wang, L. (2016). Exploring plausible causes of differential itemfunctioning in the PISA science assessment: Language, curriculum or culture. EducationalPsychology, 36(2), 378���390. doi: 10.1080/01443410.2014.946890 Jang, E. E. , & Roussos, L. (2009). Integrative analytic approach to detecting and interpretingL2 vocabulary DIF. International Journal of Testing, 9(3), 238���259. doi:10.1080/15305050903107022 Karami, H. , & Salmani Nodoushan, M. A. (2011). Differential item functioning: Current problemsand future directions. International Journal of Language Studies, 5(3), 133���142.

Page 37: Quantitative Data Analysis for - ESEC

Koo, J. , Becker, B. J. , & Kim, Y.-S. (2014). Examining differential item functioning trends forEnglish language learners in a reading test: A meta-analytical approach. Language Testing,31(1), 89���109. doi: 10.1177/0265532213496097 Kunnan, A. J. (Ed.). (2000). Fairness and validation in language assessment: Selected papersfrom the 19th Language Testing Research Colloquium, Orlando, Florida. Cambridge:Cambridge University Press. Kunnan, A. J. (2007). Test fairness, test bias, and DIF. Language Assessment Quarterly, 4(2),109���112. doi: 10.1080/15434300701375865 Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? RaschMeasurement Transactions, 16(2), 878. Linacre, J. M. (2003). PCA: Data variance: Explained, modeled and empirical. RaschMeasurement Transactions, 13(3), 942���943. Linacre, J. M. (2018a). DIF-DPF-bias-interactions concepts. Winsteps Help for Rasch Analysis.Retrieved from www.winsteps.com/winman/difconcepts.htm 128 Linacre, J. M. (2018b). Dimensionality: Contrasts & variances. Winsteps Help for RaschAnalysis. Retrieved from www.winsteps.com/winman/principalcomponents.htm Linacre, J. M. (2018c). Winsteps�� (Version 4.2.0) [Computer software]. Beaverton, OR:Winsteps.com . Retrieved from www.winsteps.com Linacre, J. M. , & Wright, B. D. (1987). Item bias: Mantel-Haenszel and the Rasch model.Memorandum No. 39. Retrieved from MESA Psychometric Laboratory, University of Chicago:www.rasch.org/memo39.pdf Linacre, J. M. , & Wright, B. D. (1989). Mantel-Haenszel DIF and PROX are equivalent! RaschMeasurement Transactions, 3(2), 52���53. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale,NJ: Lawrence Erlbaum Associates. Lund, R. J. (1990). A taxonomy for teaching second language listening. Foreign LanguageAnnals, 23(2), 105���115. Luppescu, S. (1993). DIF detection examined: Which item has the real differential itemfunctioning? Rasch Measurement Transactions, 7(2), 285���286. Magis, D. , Raiche, G. , Beland, S. , & Gerard, P. (2011). A generalized logistic regressionprocedure to detect differential item functioning among multiple groups. International Journal ofTesting, 11(4), 365���386. doi: 10.1080/15305058.2011.602810 Magis, D. , Tuerlinckx, F. , & De Boeck, P. (2015). Detection of differential item functioningusing the lasso approach. Journal of Educational and Behavioral Statistics, 40(2), 111���135. doi:10.3102/1076998614559747 Mantel, N. , & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospectivestudies of disease. JNCI: Journal of the National Cancer Institute, 22(4), 719���748. doi:10.1093/jnci/22.4.719 Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. Journal ofEducational Statistics, 7(2), 105���118. doi: 10.2307/1164960 Mellenbergh, G. J. (2005a). Item bias detection: Classical approaches. In B. S. Everitt & D. C.Howell (Eds.), Encyclopedia of statistics in behavioral science (Vol. 2, pp. 967���970). WestSussex: Wiley. Mellenbergh, G. J. (2005b). Item bias detection: Modern approaches. In B. S. Everitt & D. C.Howell (Eds.), Encyclopedia of statistics in behavioral science (Vol. 2, pp. 970���974). WestSussex: Wiley. Millsap, R. E. , & Everson, H. T. (1993). Methodology review: Statistical approaches forassessing measurement bias. Applied Psychological Measurement, 17(4), 297���334. doi:10.1177/014662169301700401 Mullis, I. V. S. , Martin, M. O. , Kennedy, A. M. , & Foy, P. (2007). PIRLS 2006 InternationalReport: IEA���s Progress in International Reading Literacy Study in primary schools in 40countries. Retrieved from https://timss.bc.edu/PDF/PIRLS2006_international_report.pdf Nandakumar, R. (1993). Simultaneous DIF amplification and cancellation: Shealy-Stout���s test forDIF. Journal of Educational Measurement, 30(4), 293���311. Oliveri, M. E. , Ercikan, K. , Lyons-Thomas, J. , & Holtzman, S. (2016). Analyzing fairnessamong linguistic minority populations using a latent class differential item functioning approach.Applied Measurement in Education, 29(1), 17���29. doi: 10.1080/08957347.2015.1102913 Oliveri, M. E. , Lawless, R. , Robin, F. , & Bridgeman, B. (2018). An exploratory analysis ofdifferential item functioning and its possible sources in a higher education 129admissionscontext. Applied Measurement in Education, 31(1), 1���16. doi: 10.1080/08957347.2017.1391258 Ownby, R. L. , & Waldrop-Valverde, D. (2013). Differential item functioning related to age in thereading subtest of the Test of Functional Health Literacy in adults. Journal of Aging Research,2013, 6. doi: 10.1155/2013/654589

Page 38: Quantitative Data Analysis for - ESEC

Pae, T.-I. (2004). DIF for examinees with different academic backgrounds. Language Testing,21(1), 53���73. doi: 10.1191/0265532204lt274oa Pae, T.-I. (2012). Causes of gender DIF on an EFL language test: A multiple-data analysis overnine years. Language Testing, 29(4), 533���554. doi: 10.1177/0265532211434027 Park, G.-P. (2008). Differential item functioning on an English listening test across gender.TESOL Quarterly, 42(1), 115���123. doi: 10.1002/j.1545-7249.2008.tb00212.x Roussos, L. , & Stout, W. (1996). A multidimensionality-based DIF analysis paradigm. AppliedPsychological Measurement, 20(4), 355���371. doi: 10.1177/014662169602000404 Roussos, L. , & Stout, W. (2004). Differential item functioning analysis: Detecting DIF items andtesting DIF hypotheses. In D. Kaplan (Ed.), The SAGE handbook of quantitative methodologyfor the social sciences (pp. 107���115). Thousand Oaks, CA: SAGE Publications. doi:10.4135/9781412986311 Runnels, J. (2013). Measuring differential item and test functioning across academic disciplines.Language Testing in Asia, 3(1), 1���11. doi: 10.1186/2229-0443-3-9 Salubayba, T. (2013). Differential item functioning detection in reading comprehension testusing Mantel-Haenszel, item response theory, and logical data analysis. The InternationalJournal of Social Sciences, 14 (1), 76���82. Salzberger, T. , Newton, F. J. , & Ewing, M. T. (2014). Detecting gender item bias anddifferential manifest response behavior: A Rasch-based solution. Journal of Business Research,67(4), 598���607. doi: 10.1016/j.jbusres.2013.02.045 Sandilands, D. , Oliveri, M. E. , Zumbo, B. D. , & Ercikan, K. (2013). Investigating sources ofdifferential item functioning in international large-scale assessments using a confirmatoryapproach. International Journal of Testing, 13(2), 152���174. doi: 10.1080/15305058.2012.690140 Schauberger, G. , & Tutz, G. (2015). Detection of differential item functioning in Rasch modelsby boosting techniques. British Journal of Mathematical and Statistical Psychology, 69(1),80���103. doi: 10.1111/bmsp.12060 Shealy, R. , & Stout, W. (1993a). An item response theory model for test bias. In P. W. Holland& H. Wainer (Eds.), Differential item functioning (pp. 197���239). Hillsdale, NJ: Lawrence ErlbaumAssociates. Shealy, R. , & Stout, W. (1993b). A model-based standardization approach that separates truebias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF.Psychometrika, 58(2), 159���194. doi: 10.1007/BF02294572 Shimizu, Y. , & Zumbo, B. D. (2005). A logistic regression for differential item functioning primer.Japan Language Testing Association, 7, 110���124. doi: 10.20622/jltaj.7.0_110 Shin, S.-K. (2005). Did they take the same test? Examinee language proficiency and thestructure of language tests. Language Testing, 22(1), 31���57. doi: 10.1191/0265532205lt296oa Sireci, S. G. , & Rios, J. A. (2013). Decisions that make a difference in detecting differential itemfunctioning. Educational Research and Evaluation, 19(2���3), 170���187. doi:10.1080/13803611.2013.767621 130 Smith, R. M. (1994). Detecting item bias in the Rasch rating scale model. Educational andPsychological Measurement, 54(4), 886���896. Song, M.-Y. (2008). Do divisible subskills exist in second language (L2) comprehension? Astructural equation modeling approach. Language Testing, 25(4), 435���464. doi:10.1177/0265532208094272 Swaminathan, H. , & Rogers, H. J. (1990). Detecting differential item functioning using logisticregression procedures. Journal of Educational Measurement, 27(4), 361���370. doi:10.1111/j.1745-3984.1990.tb00754.x Tennant, A. , & Pallant, J. (2007). DIF matters: A practical approach to test if differential itemfunctioning makes a difference. Rasch Measurement Transactions, 20(4), 1082���1084. Teresi, J. A. (2006). Different approaches to differential item functioning in health applications:Advantages, disadvantages and some neglected topics. Medical Care, 44(11), S152���S170. Thissen, D. , Steinberg, L. , & Wainer, H. (1993). Detection of differential item functioning usingthe parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential itemfunctioning (pp. 67���113). Hillsdale, NJ: Lawrence Erlbaum Associates. Uiterwijk, H. , & Vallen, T. (2005). Linguistic sources of item bias for second generationimmigrants in Dutch tests. Language Testing, 22(2), 211���234. doi: 10.1191/0265532205lt301oa University of Macau . (2016, November 18). Registered students. Retrieved fromhttps://reg.umac.mo/qfacts/y2016/student/registered-students/ Urmston, A. , Raquel, M. R. , & Tsang, C. (2013). Diagnostic testing of Hong Kong tertiarystudents��� English language proficiency: The development and validation of DELTA. Hong KongJournal of Applied Linguistics, 14(2), 60���82. Uusen, A. , & M����rsepp, M. (2012). Gender differences in reading habits among boys and girls ofbasic school in Estonia. Procedia ��� Social and Behavioral Sciences, 69, 1795���1804. doi:doi.org/10.1016/j.sbspro.2012.12.129

Page 39: Quantitative Data Analysis for - ESEC

van de Vijver, F. J. R. , & Poortinga, Y. H. (1997). Towards an integrated analysis of bias incross-cultural assessment. European Journal of Psychological Assessment, 13(1), 29���37. doi:10.1027/1015-5759.13.1.29 Wedman, J. (2018). Reasons for gender-related differential item functioning in a collegeadmissions test. Scandinavian Journal of Educational Research, 62(6), 959���970. doi:10.1080/00313831.2017.1402365 Wyse, A. (2013). DIF cancellation in the Rasch model. Journal of Applied Measurement, 14(2),118���128. Xi, X. (2010). How do we go about investigating test fairness? Language Testing, 27(2),147���170. doi: 10.1177/0265532209349465 Yan, X. (2017). The language situation in Macao. Current Issues in Language Planning, 18(1),1���38. doi: 10.1080/14664208.2016.1125594 Yan, X. , Cheng, L. , & Ginther, A. (2018). Factor analysis for fairness: Examining the impact oftask type and examinee L1 background on scores of an ITA speaking test. Language Testing,1���28. doi: 10.1177/0265532218775764 Yoo, H. , & Manna, V. F. (2015). Measuring English language workplace proficiency acrosssubgroups: Using CFA models to validate test score interpretation. Language Testing, 34(1),101���126. doi: 10.1177/0265532215618987 Yoo, H. , Manna, V. F. , Monfils, L. F. , & Oh, H.-J. (2018). Measuring English languageproficiency across subgroups: Using score equity assessment to evaluate test fairness.Language Testing. doi: 10.1177/0265532218776040 131 Young, M. Y. C. (2011). English use and education in Macao. In A. Feng (Ed.), Englishlanguage education across greater China (pp. 114���130). Bristol, UK: MultilingualMatters/Channel View Publications. Zumbo, B. D. (2007). Three generations of DIF analyses: Considering where it has been, whereit is now, and where it is going. Language Assessment Quarterly, 4(2), 223���233. doi:10.1080/15434300701375832 Zumbo, B. D. , Liu, Y. , Wu, A. D. , Shear, B. R. , Olvera Astivia, O. L. , & Ark, T. K. (2015). Amethodology for Zumbo���s third generation DIF analyses and the ecology of item responding.Language Assessment Quarterly, 12(1), 136���151. doi: 10.1080/15434303.2014.972559 Zumbo, B. D. , Liu, Y. , Wu, A. D. , Shear, B. R. , Olvera Astivia, O. L. , & Ark, T. K. (2015). Amethodology for Zumbo���s third generation DIF analyses and the ecology of item responding.Language Assessment Quarterly, 12(1), 136���151. doi: 10.1080/15434303.2014.972559

Application of the rating scale model and the partial credit model inlanguage assessment research Adams, R. J. , Griffin, P. E. , & Martin, L . (1987). A latent trait method for measuring adimension in second language proficiency. Language Testing, 4(1), 9���27. Agresti, A. (2012). Categorical data analysis (3rd ed.). Hoboken, NJ: John Wiley & Sons Inc. Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69���81. Andrich, D. (1978a). Application of a psychometric rating model to ordered categories which arescored with successive integers. Applied Psychological Measurement, 2(4), 581���594. Andrich, D. (1978b). A rating formulation for ordered response categories. Psychometrika,43(4), 561���573. Aryadoust, V. (2012). How does ���sentence structure and vocabulary��� function as a scoringcriterion alongside other criteria in writing assessment? International Journal of LanguageTesting, 2(1), 28���58. Aryadoust, V. , Goh, C. C. M. , & Kim, L. O. (2012). Developing and validating an academiclistening questionnaire. Psychological Test and Assessment Modeling, 54(3), 227���256. Bond, T. , & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in thehuman sciences (3rd ed.). New York, NY: Routledge. Cai, L. , & Hansen, M. (2013). Limited-information goodness-of-fit testing of hierarchical itemfactor models. British Journal of Mathematical and Statistical Psychology, 66(2), 245���276. Chalmers, R. P. (2012). Mirt: A multidimensional item response theory package for the Renvironment. Journal of Statistical Software, 48(6), 1���29. du Toit, M. (Ed.). (2003). IRT from SSI: Bilog-MG, multilog, parscale, testfact. Lincolnwood, IL:Scientific Software International, Inc. Eckes, T. (2012). Examinee-centered standard setting for large-scale assessments: Theprototype group method. Psychological Test and Assessment Modeling, 54(3), 257���283.

Page 40: Quantitative Data Analysis for - ESEC

Eckes, T. (2017). Setting cut scores on an EFL placement test using the prototype groupmethod: A receiver operating characteristic (ROC) analysis. Language Testing, 34(3), 383���411. Fischer, G. H. , & Molenaar, I. W. (Eds.). (1995). Rasch models: Foundations, recentdevelopments, and applications. New York, NY: Springer Science & Business Media. Fulcher, G. (1996). Testing tasks: Issues in task design and the group oral. Language Testing,13(1), 23���51. 151 Glas, C. A. W. (2016). Maximum-likelihood estimation. In W. van der Linden (Ed.),Handbook of item response theory (Vol. 2, pp. 197���216). Boca Raton, FL: CRC Press. Haberman, S. J. (2006). Joint and conditional estimation for implicit models for tests withpolytomous item scores (ETS RR-06-03). Princeton, NJ: Educational Testing Service. Haberman, S. J. (2016). Models with nuisance and incidental parameters. In W. van der Linden(Ed.), Handbook of item response theory (Vol. 2, pp. 151���170). Boca Raton, FL: CRC Press. Hambleton, R. K. , & Han, N. (2005). Assessing the fit of IRT models to educational andpsychological test data: A five-step plan and several graphical displays. In R. R. Lenderking &D. A. Revicki (Eds.), Advancing health outcomes research methods and clinical applications(pp. 57���77). McLean, VA: Degnon Associates. Hambleton, R. K. , Swaminathan, H. , & Rogers, H. J. (1991). Fundamentals of item responsetheory (Vol. 2). Newbury Park, CA: SAGE Publications. Karabatsos, G. (2000). A critique of Rasch residual fit statistics. Journal of AppliedMeasurement, 1, 152���176. Kunnan, A. J. (1991). Modeling relationships among some test-taker characteristics andperformance on tests of English as a foreign language. Unpublished doctoral dissertation,University of California, LA. Lee, Y.-W. , Gentile, C. , & Kantor, R. (2008). Analytic scoring of TOEFL CBT essays: Scoresfrom humans and E-rater �� (ETS RR-08-81). Princeton, NJ: Educational Testing Service. Lee-Ellis, S. (2009). The development and validation of a Korean C-test using Rasch analysis.Language Testing, 26(2), 245���274. Linacre, J. M. (1994). Many-facet Rasch measurement (2nd ed.). Chicago, IL: MESA Press. Linacre, J. M. (2017a). Winsteps �� Rasch measurement computer program [Computer software].Beaverton, OR: Winsteps.com. Linacre, J. M. (2017b). Winsteps �� Rasch measurement computer program user���s guide .Beaverton, OR: Winsteps.com. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149���174. Maydeu-Olivares, A. , & Joe, H. (2005). Limited and full information estimation and goodness-of-fit testing in 2n contingency tables: A unified framework. Journal of the American StatisticalAssociation, 100(471), 1009���1020. McNamara, T. , & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurementin language testing. Language Testing, 29(4), 555���576. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. AppliedPsychological Measurement, 16(2), 159���176. Neyman, J. , & Scott, E. L. (1948). Consistent estimation from partially consistent observations.Econometrica, 16, 1���32. Pollitt, A. , & Hutchinson, C. (1987). Calibrating graded assessments: Rasch partial creditanalysis of performance in writing. Language Testing, 4(1), 72���97. R Core Team . (2018). R: A language and environment for statistical computing. Vienna,Austria: R Foundation for Statistical Computing. Rose, N. , von Davier, M. , & Xu, X. (2010). Modeling nonignorable missing data with itemresponse theory (ETS RR-10-11). Princeton, NJ: Educational Testing Service. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores.Psychometrika, Monograph Supplement No. 17. 152 Stewart, J. , Batty, A. O. , & Bovee, N. (2012). Comparing multidimensional and continuummodels of vocabulary acquisition: An empirical examination of the vocabulary knowledge scale.TESOL Quarterly, 46(4), 695���721. Suzuki, Y. (2015). Self-assessment of Japanese as a second language: The role of experiencesin the naturalistic acquisition. Language Testing, 32(1), 63���81. Van Moere, A. (2012). A psycholinguistic approach to oral language assessment. LanguageTesting, 29(3), 325���344. Wells, C. S. , & Hambleton, R. K. (2016). Model fit with residual analysis. In W. van der Linden(Ed.), Handbook of item response theory (Vol. 2, pp. 395���413). Boca Raton, FL: CRC Press. Wright, B. D. , & Linacre, J. M. (1994). Reasonable mean-square fit values. RaschMeasurement Transactions, 8, 370. Wright, B. D. , & Masters, G. N. (1982). Rating scale analysis. Chicago, IL: MESA Press.

Page 41: Quantitative Data Analysis for - ESEC

Many-facet Rasch measurement Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43,561���573. Andrich, D. (1998). Thresholds, steps and rating scale conceptualization. Rasch MeasurementTransactions, 12, 648���649. Aryadoust, V. (2016). Gender and academic major bias in peer assessment of oralpresentations. Language Assessment Quarterly, 13, 1���24. Baker, B. A. (2012). Individual differences in rater decision-making style: An exploratory mixed-methods study. Language Assessment Quarterly, 9, 225���248. Barkaoui, K. (2014). Multifaceted Rasch analysis for test evaluation. In A. J. Kunnan (Ed.), Thecompanion to language assessment: Evaluation, methodology, and interdisciplinary themes(Vol. 3, pp. 1301���1322). Chichester: Wiley. Bejar, I. I. (2012). Rater cognition: Implications for validity. Educational Measurement: Issuesand Practice, 31(3), 2���9. 173 Bonk, W. J. , & Ockey, G. J. (2003). A many-facet Rasch analysis of the second languagegroup oral discussion task. Language Testing, 20, 89���110. Brennan, R. L. (2011). Generalizability theory and classical test theory. Applied Measurement inEducation, 24, 1���21. Carr, N. T. (2011). Designing and analyzing language tests. Oxford: Oxford University Press. Casabianca, J. M. , Junker, B. W. , & Patz, R. J. (2016). Hierarchical rater models. In W. J. vander Linden (Ed.), Handbook of item response theory (Vol. 1, pp. 449���465). Boca Raton, FL:Chapman & Hall/CRC. Coniam, D. (2010). Validating onscreen marking in Hong Kong. Asia Pacific Education Review,11, 423���431. Curcin, M. , & Sweiry, E. (2017). A facets analysis of analytic vs. holistic scoring of identicalshort constructed-response items: Different outcomes and their implications for scoring rubricdevelopment. Journal of Applied Measurement, 18, 228���246. DeCarlo, L. T. , Kim, Y. K. , & Johnson, M. S. (2011). A hierarchical rater model for constructedresponses, with a signal detection rater model. Journal of Educational Measurement, 48,333���356. Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performanceassessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2, 197���221. Eckes, T. (2008). Rater types in writing performance assessments: A classification approach torater variability. Language Testing, 25, 155���185. Eckes, T. (2012). Operational rater types in writing assessment: Linking rater cognition to raterbehavior. Language Assessment Quarterly, 9, 270���292. Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluatingrater-mediated assessments (2nd ed.). Frankfurt am Main: Peter Lang. Eckes, T. (2017). Rater effects: Advances in item response modeling of human ratings ��� Part I(Guest Editorial). Psychological Test and Assessment Modeling, 59(4), 443���452. Elder, C. , Barkhuizen, G. , Knoch, U. , & von Randow, J. (2007). Evaluating rater responses toan online training program for L2 writing assessment. Language Testing, 24, 37���64. Engelhard, G. (2013). Invariant measurement: Using Rasch models in the social, behavioral,and health sciences. New York, NY: Routledge. Engelhard, G. , Wang, J. , & Wind, S. A. (2018). A tale of two models: Psychometric andcognitive perspectives on rater-mediated assessments using accuracy ratings. PsychologicalTest and Assessment Modeling, 60(1), 33���52. Engelhard, G. , & Wind, S. A. (2018). Invariant measurement with raters and rating scales:Rasch models for rater-mediated assessments. New York, NY: Routledge. Hsieh, M. (2013). An application of multifaceted Rasch measurement in the Yes/No Angoffstandard setting procedure. Language Testing, 30, 491���512. Jeong, H. (2017). Narrative and expository genre effects on students, raters, and performancecriteria. Assessing Writing, 31, 113���125. Johnson, R. L. , Penny, J. A. , & Gordon, B. (2009). Assessing performance: Designing,scoring, and validating performance tasks. New York, NY: Guilford. Knoch, U. , & Chapelle, C. A. (2018). Validation of rating processes within an argument-basedframework. Language Testing, 35, 477���499. Knoch, U. , Read, J. , & von Randow, J. (2007). Re-training writing raters online: How does itcompare with face-to-face training? Assessing Writing, 12, 26���43. 174 Lamprianou, I . (2006). The stability of marker characteristics across tests of the samesubject and across subjects. Journal of Applied Measurement, 7, 192���205. Lane, S. , & Iwatani, E. (2016). Design of performance assessments in education. In S. Lane ,M. R. Raymond , & T. M. Haladyna (Eds.), Handbook of test development (2nd ed., pp.

Page 42: Quantitative Data Analysis for - ESEC

274���293). New York, NY: Routledge. Lee, Y.-W. , & Kantor, R. (2015). Investigating complex interaction effects among facetelements in an ESL writing test consisting of integrated and independent tasks. LanguageResearch, 51(3), 653���678. Li, H. (2016). How do raters judge spoken vocabulary? English Language Teaching, 9(2),102���115. Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, IL: MESA Press. Linacre, J. M. (2000). Comparing ���Partial Credit Models��� (PCM) and ���Rating Scale Models��� (RSM).Rasch Measurement Transactions, 14, 768. Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? RaschMeasurement Transactions, 16, 878. Linacre, J. M. (2003). The hierarchical rater model from a Rasch perspective. RaschMeasurement Transactions, 17, 928. Linacre, J. M. (2006). Demarcating category intervals. Rasch Measurement Transactions, 19,1041���1043. Linacre, J. M. (2010). Transitional categories and usefully disordered thresholds. OnlineEducational Research Journal, 1(3). Linacre, J. M. (2018a). Facets Rasch measurement computer program (Version 3.81)[Computer software]. Chicago, IL: Winsteps.com. Linacre, J. M. (2018b). A user���s guide to FACETS: Rasch-model computer programs. Chicago,IL: Winsteps.com. Retrieved from www.winsteps.com/facets.htm Looney, M. A. (2012). Judging anomalies at the 2010 Olympics in men���s figure skating.Measurement in Physical Education and Exercise Science, 16, 55���68. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149���174. McNamara, T. , & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurementin language testing. Language Testing, 29, 555���576. Mulqueen, C. , Baker, D. P. , & Dismukes, R. K. (2002). Pilot instructor rater training: The utilityof the multifacet item response theory model. International Journal of Aviation Psychology, 12,287���303. Myford, C. M. , & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facetRasch measurement: Part I. Journal of Applied Measurement, 4, 386���422. Myford, C. M. , & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facetRasch measurement: Part II. Journal of Applied Measurement, 5, 189���227. Norris, J. , & Drackert, A. (2018). Test review: TestDaF. Language Testing, 35, 149���157. Randall, J. , & Engelhard, G. (2009). Examining teacher grades using Rasch measurementtheory. Journal of Educational Measurement, 46, 1���18. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago, IL:University of Chicago Press (Original work published 1960). Reed, D. J. , & Cohen, A. D. (2001). Revisiting raters and ratings in oral language assessment.In C. Elder (Eds.), Experimenting with uncertainty: Essays in honour of Alan Davies (pp. 82���96).Cambridge: Cambridge University Press. 175 Robitzsch, A. , & Steinfeld, J. (2018). Item response models for human ratings: Overview,estimation methods and implementation in R. Psychological Test and Assessment Modeling,60(1), 101���138. Schaefer, E. (2016). Identifying rater types among native English-speaking raters of Englishessays written by Japanese university students. In V. Aryadoust & J. Fox (Eds.), Trends inlanguage assessment research and practice: The view from the Middle East and the PacificRim (pp. 184���207). Newcastle upon Tyne: Cambridge Scholars. Springer, D. G. , & Bradley, K. D. (2018). Investigating adjudicator bias in concert bandevaluations: An application of the many-facets Rasch model. Musicae Scientiae, 22, 377���393. Till, H. , Myford, C. , & Dowell, J. (2013). Improving student selection using multiple mini-interviews with multifaceted Rasch modeling. Academic Medicine, 88, 216���223. Wang, J. , Engelhard, G. , Raczynski, K. , Song, T. , & Wolfe, E. W. (2017). Evaluating rateraccuracy and perception for integrated writing assessments using a mixed-methods approach.Assessing Writing, 33, 36���47. Wilson, M. (2011). Some notes on the term: ���Wright map���. Rasch Measurement Transactions, 25,1331. Wind, S. A. , & Peterson, M. E. (2018). A systematic review of methods for evaluating ratingquality in language assessment. Language Testing, 35, 161���192. Wind, S. A. , & Schumacker, R. E. (2017). Detecting measurement disturbances in rater-mediated assessments. Educational Measurement: Issues and Practice, 36(4), 44���51. Winke, P. , Gass, S. , & Myford, C. (2013). Raters��� L2 background as a potential source of biasin rating oral performance. Language Testing, 30, 231���252.

Page 43: Quantitative Data Analysis for - ESEC

Wolfe, E. W. , & Song, T. (2015). Comparison of models and indices for detecting ratercentrality. Journal of Applied Measurement, 16(3), 228���241. Wright, B. D. , & Masters, G. N. (1982). Rating scale analysis. Chicago, IL: MESA Press. Wright, B. D. , & Stone, M. H. (1979). Best test design. Chicago, IL: MESA Press. Wu, M. (2017). Some IRT-based analyses for interpreting rater effects. Psychological Test andAssessment Modeling, 59(4), 453���470. Yan, X. (2014). An examination of rater performance on a local oral English proficiency test: Amixed-methods approach. Language Testing, 31, 501���527. Yen, W. M. , & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.),Educational measurement (4th ed., pp. 111���153). Westport, CT: American Council onEducation/Praeger. Zhang, J. (2016). Same text different processing? Exploring how raters��� cognitive and meta-cognitive strategies influence rating accuracy in essay scoring. Assessing Writing, 27, 37���53.

Analysis of differences between groups Agresti, A. , & Finlay, B. (2009). Statistical methods for the social sciences (4th ed.). London:Pearson Prentice Hall. Allen, I. E. , & Seaman, C. A. (2007). Likert scales and data analyses. Quality Progress, 40(7),64���65. Argyrous, G. (1996). Statistics for social research. Melbourne: MacMillan Education. Barkaoui, K. (2014). Examining the impact of L2 proficiency and keyboarding skills on scores onTOEFL-iBT writing tasks. Language Testing, 31(2), 241���259. Batty, A. O. (2015). A comparison of video- and audio-mediated listening tests with many-facetRasch modeling and differential distractor functioning. Language Testing, 32(1), 3���20. Beglar, D. , & Hunt, A. (1999). Revising and validating the 2000 Word Level and UniversityWord Level Vocabulary Tests. Language Testing, 16(2), 131���162. Brown, J. D. (2008). Effect size and eta squared. Shiken: JALT Testing & Evaluation SIGNewsletter, 12(2), 38���43. Casey, L. B. , Miller, N. D. , Stockton, M. B. , & Justice, W. V. (2016). Assessing writing inelementary schools: Moving away from a focus on mechanics. Language AssessmentQuarterly, 13(1), 42���54. Chae, S. (2003). Adaptation of a picture-type creativity test for pre-school children. LanguageTesting, 20(2), 179���188. Chang, Y. (2006). On the use of the immediate recall task as a measure of second languagereading comprehension. Language Testing, 23(4), 520���543. Chapelle, C. A. , Chung, Y. , Hegelheimer, V. , Pendar, N. , & Xu, J. (2010). Towards acomputer-delivered test of productive grammatical ability. Language Testing, 27(4) 443���469. Cheng, L. , Andrews, S. , & Yu, Y. (2010). Impact and consequences of school-basedassessment (SBA): Students��� and parents��� views of SBA in Hong Kong. Language Testing,28(2), 221���249. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:Lawrence Erlbaum Associates. Coniam, D. (2000). The use of audio or video comprehension as an assessment instrument inthe certification of English language teachers: A case study. System, 29(1), 1���14. Connolly, T. , & Sluckin, W. (1971). An introduction to statistics for the social sciences. London:MacMillan Education. Coyle, Y. , & G��mez Gracia, R. (2014). Using songs to enhance L2 vocabulary acquisition inpreschool children. ELT Journal, 68(3), 276���285. Currie, M. , & Chiramanee, T. (2010). The effect of the multiple-choice item format on themeasurement of knowledge of language structure. Language Testing, 27(4), 471���491. Dancey, P. , & Reidy, J. (2011). Statistics without maths for psychology. New York, NY:Prentice Hall/Pearson. 196 Davis, L. (2016). The influence of training and experience on rater performance in scoringspoken language. Language Testing, 33(1), 117���135. Davis, S. F. , & Smith, R. A. (2005). An introduction to statistics and research methods:Becoming a psychological detective. Upper Saddle River, NJ: Pearson/Prentice Hall. East, M. (2007). Bilingual dictionaries in tests of L2 writing proficiency: Do they make adifference? Language Testing, 24(3), 331���353.

Page 44: Quantitative Data Analysis for - ESEC

Elgort, I. (2012). Effects of L1 definitions and cognate status of test items on the VocabularySize Test. Language Testing, 30(2), 253���272. Fidalgo, A. , Alavi, S. , & Amirian, S. (2014). Strategies for testing statistical and practicalsignificance in detecting DIF with logistic regression models. Language Testing, 31(4) 433���451. Gebril, A. , & Plakans, L. (2013). Toward a transparent construct of reading-to-write tasks: Theinterface between discourse features and proficiency. Language Assessment Quarterly, 10(1),9���27. Gordon, R. A. (2012). Applied statistics for the social and health sciences. New York, NY:Routledge. Grabe, W. (2009). Reading in a second language: Moving from theory to practice. Cambridge:Cambridge University Press. Green, R. (2013). Statistical analyses for language testers. New York, NY: Palgrave Macmillan. He, L. , & Shi, L. (2012). Topical knowledge and ESL writing. Language Testing, 29(3), 443���464. Ilc, G. , & Stopar, A. (2015). Validating the Slovenian national alignment to CEFR: The case ofthe B2 reading comprehension examination in English. Language Testing, 32(4), 443���462. Kiddle, T. , & Kormos, J. (2011). The effect of mode of response on a semidirect test of oralproficiency. Language Assessment Quarterly, 8(4), 342���360. Kim, A. Y. (2015). Exploring ways to provide diagnostic feedback with an ESL placement test:Cognitive diagnostic assessment of L2 reading ability. Language Testing, 32(2), 227���258. Kintsch, W. (1988). The role of knowledge in discourse comprehension: A construction-integration model. Psychological Review, 95(2), 163���182. Knoch, U. , & Elder, C. (2010). Validity and fairness implications of varying time conditions on adiagnostic test of academic English writing proficiency. System, 38(1), 63���74. Kobayashi, M. (2002). Method effects on reading comprehension test performance: Textorganization and response format. Language Testing, 19(2), 193���220. Kormos, J. (1999). Simulating conversations in oral proficiency assessment: A conversationanalysis of role plays and non-scripted interviews in language exams. Language Testing, 16(2),163���188. Leaper, D. A. , & Riazi, M. (2014). The influence of prompt on group oral tests. LanguageTesting, 31(2), 177���204. Lee, H. , & Winke, P. (2012). The differences among three-, four-, and five-option-item formatsin the context of a high-stakes English-language listening test. Language Testing, 30(1), 99���123. Lehmann, E. L. , & Romano, J. P. (2005). Generalizations of the familywise error rate. TheAnnals of Statistics, 33(3), 1138���1154. Leong, C. , Ho, M. , Chang, J. , & Hau, K. (2012). Differential importance of languagecomponents in determining secondary school students��� Chinese reading literacy performance.Language Testing, 30(4), 419���439. 197 Levine, T. R. , & Hullett, C. R. (2002). Eta squared, partial eta squared, and misre-porting ofeffect size in communication research. Human Communication Research, 28(4), 612���625. Li, X. , & Brand, M. (2009). Effectiveness of music on vocabulary acquisition, language usage,and meaning for mainland Chinese ESL learners. Contributions to Music Education, 73���84. Mann, W. , Roy, P. , & Morgan, G. (2016). Adaptation of a vocabulary test from British SignLanguage to American Sign Language. Language Testing, 33(1), 3���22. Neri, A. , Mich, O. , Gerosa, M. , & Giuliani, D. (2008). The effectiveness of computer assistedpronunciation training for foreign language learning by children. Computer Assisted LanguageLearning, 21(5), 393���408. Nitta, R. , & Nakatsuhara, F. (2014). A multifaceted approach to investigating pre-task planningeffects on paired oral test performance. Language Testing, 31(2), 147���175. Ockey, G. (2007). Construct implications of including still image or video in computer-basedlistening tests. Language Testing, 24(4), 517���537. O���Sullivan, B. (2002). Learner acquaintanceship and oral proficiency test pair-task performance.Language Testing, 19(3), 277���295. Ott, R. L. , & Longnecker, M. (2001). An introduction to statistical methods and data analysis(5th ed.). Belmont, CA: Duxbury Press. Paquette, K. R. , & Rieg, S. A. (2008). Using music to support the literacy development of youngEnglish language learners. Early Childhood Education Journal, 36(3), 227���232. Richardson, J. T. (2011). Eta squared and partial eta squared as measures of effect size ineducational research. Educational Research Review, 6(2), 135���147. Saricoban, A. , & Metin, E. (2000). Songs, verse and games for teaching grammar. The InternetTESL Journal, 6(10), 1���7. Sasaki, M. (2000). Effects of cultural schemata on students��� test-taking processes for clozetests: A multiple data source approach. Language Testing, 17(1), 85���114.

Page 45: Quantitative Data Analysis for - ESEC

Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing,25(4), 465���493. Schmitt, N. , Ching, W. , & Garras, J. (2011). The Word Associates Format: Validation evidence.Language Testing, 28(1), 105���126. Schoonen, R. , & Verhallen, M. (2008). The assessment of deep word knowledge in young firstand second language learners. Language Testing, 25(2), 211���236. Seaman, M. A. , Levin, J. R. , & Serlin, R. C. (1991). New developments in pairwise multiplecomparisons: Some powerful and practicable procedures. Psychological Bulletin, 110(3), 577. Shin, S. Y. , & Ewert, D. (2015). What accounts for integrated reading-to-write task scores?Language Testing, 32(2), 259���281. Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677���680. Vermeer, A. (2000). Coming to grips with lexical richness in spontaneous speech data.Language Testing, 17(1), 65���83. Wigglesworth, G. , & Elder, C. (2010). An investigation of the effectiveness and validity ofplanning time in speaking test tasks. Language Assessment Quarterly, 7(1), 1���24.

Application of ANCOVA and MANCOVA in language assessmentresearch * Bae, J. , & Lee, Y.-S. (2011). The validation of parallel test forms: ���Mountain��� and ���beach��� pictureseries for assessment of language skills. Language Testing, 28(2), 155���177. doi:10.1177/0265532210382446 * Barkaoui, K. (2014). Examining the impact of L2 proficiency and keyboarding skills on scoreson TOEFL-iBT writing tasks. Language Testing, 31(2), 241���259. doi:10.1177/0265532213509810 * Becker, A. (2016). Student-generated scoring rubrics: Examining their formative value forimproving ESL students��� writing performance. Assessing Writing, 29, 15���24. doi:10.1016/j.asw.2016.05.002 * Bochner, J. H. , Samar, V. J. , Hauser, P. C. , Garrison, W. M. , Searls, J. M. , & Sanders, C.A. (2016). Validity of the American Sign Language Discrimination Test. Language Testing,33(4), 473���495. doi: 10.1177/0265532215590849 * Bridgeman, B. , Trapani, C. , & Bivens-Tatum, J. (2011). Comparability of essay questionvariants. Assessing Writing, 16(4), 237���255. doi: 10.1016/j.asw.2011.06.002 Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:Lawrence Erlbaum Associates. * Diab, N. M. (2011). Assessing the relationship between different types of student feedbackand the quality of revised writing. Assessing Writing, 16(4), 274���292. doi:10.1016/j.asw.2011.08.001 * Erling, E. J. , & Richardson, J. T. E. (2010). Measuring the academic skills of universitystudents: Evaluation of a diagnostic procedure. Assessing Writing, 15(3), 177���193. doi:10.1016/j.asw.2010.08.002 Faul, F. , Erdfelder, E. , Lang, A. G. , & Buchner, A. (2007). G* Power 3: A flexible statisticalpower analysis program for the social, behavioral, and biomedical sciences. Behavior ResearchMethods, 39(2), 175���191. doi: 10.3758/BF03193146 Field, A. (2009). Discovering statistics using SPSS. London: SAGE Publications. * Green, A. B. (2005). EAP study recommendations and score gains on the IELTS AcademicWriting Test. Assessing Writing, 10(1), 44���60. doi: 10.1016/j.asw.2005.02.002 * Huang, S.-C. (2010). Convergent vs. divergent assessment: Impact on college EFL students���motivation and self-regulated learning strategies. Language Testing, 28(2), 251���271. doi:10.1177/0265532210392199 * Huang, S.-C. (2015). Setting writing revision goals after assessment for learning. LanguageAssessment Quarterly, 12(4), 363���385. doi: 10.1080/15434303.2015.1092544 217 Huberty, C. J. , & Morris, J. D. (1989). Multivariate analysis versus multiple univariateanalyses. Psychological Bulletin, 105(2), 302���308. doi: 10.1037/0033-2909.105.2.302 IBM Corporation (2016). SPSS for Windows (Version 24). Armonk, NY: IBM Corporation. Keselman, H. J. , Huberty, C. J. , Lix, L. M. , Olejnik, S. , Cribbie, R. A. , Donahue, B. , ��� & Levin,J. R. (1998). Statistical practices of educational researchers: An analysis of their ANOVA,MANOVA, and ANCOVA analyses. Review of Educational Research, 68(3), 350���386. doi:10.3102/00346543068003350

Page 46: Quantitative Data Analysis for - ESEC

* Khonbi, Z. , & Sadeghi, K. (2012). The effect of assessment type (self vs. peer vs. teacher) onIranian University EFL students��� course achievement. Language Testing in Asia, 2(4), 47���74. doi:10.1186/2229-0443-2-4-47 * Kobrin, J. L. , Deng, H. , & Shaw, E. J. (2011). The association between SAT promptcharacteristics, response features, and essay scores. Assessing Writing, 16(3), 154���169. doi:10.1016/j.asw.2011.01.001 * Ling, G. (2017a). Are TOEFL iBT�� writing test scores related to keyboard type? A survey ofkeyboard-related practices at testing centers. Assessing Writing, 31, 1���12. doi:10.1016/j.asw.2016.04.001 * Ling, G. (2017b). Is writing performance related to keyboard type? An investigation fromexaminees��� perspectives on the TOEFL iBT. Language Assessment Quarterly, 14(1), 36���53. doi:10.1080/15434303.2016.1262376 Logan, S. , & Johnston, R. (2009). Gender differences in reading ability and attitudes:Examining where these differences lie. Journal of Research in Reading, 32(2), 199���214. doi:10.1111/j.1467-9817.2008.01389.x Mair, P. , and Wilcox, R. (2017). WRS2: A collection of robust statistical methods [R packageversion 0.9���2]. Retrieved from http://CRAN.R-project.org/package=WRS2 McKenna, M. C. , Conradi, K. , Lawrence, C. , Jang, B. G. , & Meyer, J. P. (2012). Readingattitudes of middle school students: Results of a U.S. survey. Reading Research Quarterly,47(3), 283���306. doi: 10.1002/rrq.021 Nicol, A. A. M. , & Pexman, P. M. (2010). Presenting your findings: A practical guide for creatingtables. Washington, DC: American Psychological Association. * Ockey, G. J. (2009). The effects of group members��� personalities on a test taker���s L2 group oraldiscussion test scores. Language Testing, 26(2), 161���186. doi: 10.1177/0265532208101005 Organisation for Economic Co-operation and Development (OECD) . (2000). Measuring studentknowledge and skills: The PISA 2000 assessment of reading, mathematical and scientificliteracy. Paris: OECD Publishing. Organisation for Economic Co-operation and Development (OECD) . (2009). PISA 2009assessment framework key competencies in reading, mathematics and science programme forinternational student assessment. Paris: OECD Publishing. Organisation for Economic Co-operation and Development (OECD) . (2012). PISA 2009technical report. Paris: OECD Publishing. Organisation for Economic Co-operation and Development (OECD) . (2014). PISA 2012technical report. Paris: OECD Publishing. Petscher, Y. (2010). A meta-analysis of the relationship between student attitudes towardsreading and achievement in reading. Journal of Research in Reading, 33(4), 335���355. doi:10.1111/j.1467-9817.2009.01418.x Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reportingpractices in quantitative L2 research. Studies in Second Language Acquisition, 35(4), 655���687.doi: 10.1017/S0272263113000399 218 Plonsky, L. , & Oswald, F. L. (2014). How big is ���big���? Interpreting effect sizes in L2research. Language Learning, 64(4), 878���912. doi: 10.1111/lang.12079 * Riazi, A. M. (2016). Comparing writing performance in TOEFL-iBT and academicassignments: An exploration of textual features. Assessing Writing, 28, 15���27. doi:10.1016/j.asw.2016.02.001 * Ross, J. A. , Rolheiser, C. , & Hogaboam-Gray, A. (1999). Effects of self-evaluation training onnarrative writing. Assessing Writing, 6(1), 107���132. doi: 10.1016/S1075-2935(99)00003-3 * Shohamy, E. , & Inbar, O. (1991). Validation of listening comprehension tests: The effect oftext and question type. Language Testing, 8(1), 23���40. doi: 10.1177/026553229100800103 * Sundeen, T. H. (2014). Instructional rubrics: Effects of presentation options on writing quality.Assessing Writing, 21, 74���88. doi: 10.1016/j.asw.2014.03.003 Tabachnick, B. G. , & Fidell, L. S. (2013). Using multivariate statistics (4th ed.). Boston, MA:Allyn and Bacon. von Davier, M. , Gonzalez, E. , & Mislevy, R. (2009). What are plausible values and why arethey useful. In M. von Davier & Hastedt (Eds.), Issues and methodologies in large-scaleassessments. IERI Monograph Series (pp. 9���36). Princeton, NJ: International Association for theEvaluation of Educational Achievement and Educational Testing Service. * Wagner, E. (2010). The effect of the use of video texts on ESL listening test-takerperformance. Language Testing, 27(4), 493���513. doi: 10.1177/0265532209355668 * Windsor, J. (1999). Effect of semantic inconsistency on sentence grammaticality judgementsfor children with and without language-learning disabilities. Language Testing, 16(3), 293���313.doi: 10.1177/026553229901600304 * Xi, X. (2005). Do visual chunks and planning impact performance on the graph descriptiontask in the SPEAK exam? Language Testing, 22(4), 463���508. doi: 10.1191/0265532205lt305oa

Page 47: Quantitative Data Analysis for - ESEC

* Zeidner, M. (1986). Are English language aptitude tests biased towards culturally differentminority groups? Some Israeli findings. Language Testing, 3(1), 80���98. doi:10.1177/026553228600300104 * Zeidner, M. (1987). A comparison of ethnic, sex and age bias in the predictive validity ofEnglish language aptitude tests: Some Israeli data. Language Testing, 4(1), 55���71. doi:10.1177/026553228700400106

Application of linear regression in language assessment Abedi, J. (2008). Classification system for English language learners: Issues andrecommendations. Educational Measurement: Issues and Practice, 27(3), 17���31. Aguinis, H. , Petersen, S. A. , & Pierce, C. A. (1999). Appraisal of the homogeneity of errorvariance assumption and alternatives to multiple regression for estimating moderating effects ofcategorical variables. Organizational Research Methods, 2(4), 315���339. Allison, P. (2012). When can you safely ignore multicollinearity? Retrieved fromhttps://statisticalhorizons.com/multicollinearity Benoit, K. (2011). Linear regression models with logarithmic transformations. Retrieved fromhttps://pdfs.semanticscholar.org/169c/c9bbbd77cb7cec23481f6ecb-2ce071e4e94e.pdf Billings, A. B. (2016). Linear regression analysis using PROC GLM. Retrieved fromwww.stat.wvu.edu/~abilling/STAT521_ProcGLMRegr.pdf Bohrnstedt, G. W. , & Carter, T. M. (1971). Robustness in regression analysis. SociologicalMethodology, 3, 118���146. Box, G. E. P. (1979). Robustness in the strategy of scientific model building. In R. L. Launer andG. N. Wilkinson (Eds.), Robustness in statistics (pp. 201���236). New York, NY: Academic Press. Cody, R. P. , & Smith, J. K. (1991). Applied statistics and the SAS programming language.Englewood Cliffs, NJ: Prentice Hall. 241 Cummins, J. (2008). BICS and CALP: Empirical and theoretical status of the distinction. InB. Street & N. H. Hornberger (Eds.), Encyclopedia of language and education (2nd ed., Vol. 2,pp. 71���83). New York, NY: Springer Science + Business Media LLC. Darlington, R. B. (1968). Multiple regression in psychological research and practice.Psychological Bulletin, 69(3), 161���182. Gottlieb, M. (2004). English language proficiency standards: Kindergarten through grade 12.Retrieved from www.wida.us/standards/Resource_Guide_web.pdf Grace-Martin, K. (2012). Assessing the fit of regression models. Retrieved fromwww.cscu.cornell.edu/news/statnews/stnews68.pdf Green, S. (1991). How many subjects does it take to do a regression analysis? MultivariateBehavioral Research, 26(3), 499���510. Halle, T. , Hair, E. , Wandner, L. , McNamara, M. , & Chien, N. (2012). Predictors and outcomesof early versus later English language proficiency among English language learners. EarlyChildhood Research Quarterly, 27(1), 1���20. Hosmer, D. W. , & Lemeshow, S. (1999). Applied survival analysis: Regression modeling oftime to event data. New York, NY: John Wiley & Sons, Inc. Hoyt, W. , Leierer, S. , & Millington, M. (2006). Analysis and interpretation of findings usingmultiple regression techniques. Rehabilitation Counseling Bulletin, 49(4), 223���233. Jaccard, J. , Guilamo-Ramos, V. , Johansson, M. , & Bouris, A. (2006). Multiple regressionanalyses in clinical child and adolescent psychology. Journal of Clinical Child and AdolescentPsychology, 35(3), 456���479. James, G. , Witten, D. , Hastie, T. , & Tibshirani, R. (2013). An introduction to statisticallearning: With applications in R. New York, NY: Springer Science + Business Media LLC. Keith, T. (2006). Multiple regression and beyond. San Antonio, TX: Pearson Education. Linn, R. L. (2008). Validation of uses and interpretations of state assessments. Washington,DC: Technical Issues in Large-Scale Assessment, Council of Chief State School Officers. Nau, R. (2017). Statistical forecasting: Notes on regression and timeseries analysis. Retrievedfrom http://people.duke.edu/~rnau/testing.htm Osborne, J. W. , & Waters, E. (2002). Four assumptions of multiple regression that researchersshould always test. Practical Assessment, Research and Evaluation, 8(2), 1���5. Pedhazur, E. J. (1982). Multiple regression in behavioral research: Explanation and prediction.San Antonio, TX: Harcourt College Publication. Poole, M. , & O���Farrell, P. (1971). The assumptions of the linear regression model. Transactionsof the Institute of British Geographers, 52, 145���158.

Page 48: Quantitative Data Analysis for - ESEC

Schmidt, F. L. (1971). The relative efficiency of regression and simple unit predictor weights inapplied differential psychology. Educational Psychological Measurement, 31(3), 699���714. Seo, D. , & Taherbhai, H. (2018). What makes Asian English language learners tick? The Asia-Pacific Education Researcher, 27(4), 291���302. Seo, D. , Taherbhai, H. , & Franz, R. (2016). Psychometric evaluation and discussions ofEnglish language learners��� learning strategies in the listening domain. International Journal ofListening, 30(1���2), 47���66. Shieh, G. (2010). On the misconception of multicollinearity in detection of moderating effects:Multicollinearity is not always detrimental. Multivariate Behavioral Research, 45(3), 483���507. 242 Shoebottom, P. (2017). The factors that influence the acquisition of a second language.Retrieved from http://esl.fis.edu/teachers/support/factors.htm Shtatland, E. S. , Cain, E. , & Barton, M. B. (n.d.). The perils of stepwise logistic regression andhow to escape them using information criteria and the output delivery system. Paper # 222���226.Harvard Pilgrim Health Care, Harvard Medical School, Boston, MA. Retrieved fromwww2.sas.com/proceedings/sugi26/p222-26.pdf Stevens, J. (1992). Applied multivariate statistics for the social sciences (1st ed.). Hillsdale, NJ:Lawrence Erlbaum Associates. Van Roekel, D. (2008). English language learners face unique challenges. Retrieved fromhttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.190.2895&rep=rep1&type=pdf Weisberg, S. (2005). Applied linear regression (3rd ed.). Hoboken, NJ: John Wiley & Sons, Inc. Yu, C. (2016). Multicollinearity, variance inflation and orthogonalization in regression. Retrievedfrom www.creative-wisdom.com/computer/sas/collinear_VIF.html Zoghil, M. , Kazemi, S. A. , & Kalani, A. (2013). The effect of gender on language learning.Retrieved from http://jnasci.org/wp-content/uploads/2013/12/1124-1128.pdf

Application of exploratory factor analysis in language assessment Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: OxfordUniversity Press. Bachman, L. F. , Davidson, F. , & Milanovic, M. (1996). The use of test method characteristicsin the content analysis and design of EFL proficiency tests. Language Testing, 13(2), 125���150. Bachman, L. F. , & Palmer, A. S. (2010). Language Assessment in practice. Oxford: OxfordUniversity Press. Bryant, F. B. , & Yarnold, P. R. (1995). Principal-components analysis and exploratory andconfirmatory factor analysis. In L. G. Grimm & P. R. Yarnold (Eds.), Reading and understandingmultivariate statistics (pp. 99���136). Washington, DC: American Psychological Association. Carr, N. T. (2006). The factor structure of test task characteristics and examinee performance.Language Testing, 23(3), 269���289. Celce-Murcia, M. , & Larsen-Freeman, D. (1999). The grammar book: An ESL/EFL teacher���scourse (2nd ed.). Boston, MA: Heinle & Heinle Publishers. Cohen, A. D. (2006). The coming age of research on test-taking strategies. LanguageAssessment Quarterly, 3(4), 307���331. Cohen, A. D. (2013). Using test: Wiseness strategy research in task development. In A. J.Kunnan (Ed.), The companion to language assessment: Evaluation, methodology, 260andinterdisciplinary themes (Vol. 2, pp. 893���905). Chichester: John Wiley & Sons Ltd. Cohen, A. D. , & Upton, T. A. (2006). Strategies in responding to the new TOEFL reading tasks.Monograph No. 33. Princeton, NJ: ETS. Retrieved from www.ets.org/Media/Research/pdf/RR-06-06.pdf Comrey, A. L. , & Lee, H. B. (1992). A first course in factor analysis (2nd ed.). Hillsdale, NJ:Lawrence Erlbaum Associates. Dunteman, G. H. (1989). Principal components analysis. Thousand Oaks, CA: SAGEPublications. Everitt, S. (1975). Multivariate analysis: The need for data, and other problems. British Journalof Psychiatry, 126(3), 237���240. Fang, Z. (2010). A complete guide to the college English test band 4. Beijing: Foreign LanguageTeaching and Research Press. Field, A. (2009). Discovering statistics using SPSS (3rd ed.). Singapore: SAGE Publications. Flavell, J. H. (1979). Metacognition and cognitive monitoring: A new area of cognitive-developmental inquiry. American Psychologist, 34(10), 906���911.

Page 49: Quantitative Data Analysis for - ESEC

Freedle, R. , & Kostin, I. (1993). The prediction of TOEFL reading item difficulty: Implications forconstruct validity. Language Testing, 10(2), 133���170. Gagne, E. D. , Yekovich, C. W. , & Yekovich, F. R. (1993). The cognitive psychology of schoollearning (2nd ed.). New York, NY: Harper Collins College Publishers. Green, S. B. , & Salkind, N. J. (2008). Using SPSS for Windows and Macintosh: Analyzing andunderstanding data (5th ed.). Upper Saddle River, NJ: Pearson Education Inc. Guadagnoli, E. , & Velicer, W. F. (1988). Relation of sample size to the stability of componentpatterns. Psychological Bulletin, 103(2), 265���275. Hair, J. , Black, B. , Babin, B. , Anderson, R. , & Tatham, R. (2010). Multivariate data analysis(7th ed.). Upper Saddle River, NJ: Prentice-Hall. Henson, R. K. , & Roberts, J. K. (2006). Use of exploratory factor analysis in publishedresearch: Common errors and some comment on improved practice. Educational andPsychological Measurement, 66(3), 393���416. Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational andPsychological Measurement, 20(1), 141���151. Kline, P. (1979). Psychometrics and psychology. London: Academic Press. Kunnan, A. J. (1995). Test taker characteristics and test performance: A structural modelingapproach. Cambridge: Cambridge University Press. McDonald, R. P. (1985). Factor analysis and related methods. Hillsdale, NJ: Lawrence ErlbaumAssociates. Mokhtari, K. , Sheorey, R. , & Reichard, C. A. (2008). Measuring the reading strategies of firstand second language readers. In K. Mokhtari & R. Sheorey (Eds.), Reading strategies of first-and second language learners (pp. 85���98). Norwood, MA: Christopher-Gordon Publishers, Inc. Paris, S. G. , & Winograd, P. (1990). How metacognition can promote academic learning andinstruction. In B. F. Jones & L. Idol (Eds.), Dimensions of thinking and cognitive instruction (pp.15���51). Hillsdale, NJ: Lawrence Erlbaum Associates. Pett, M. , Lackey, N. , & Sullivan, J. (2003). Making sense of factor analysis. Thousand Oaks,CA: SAGE Publications. Phakiti, A. (2008). Construct validation of Bachman and Palmer���s (1996) strategic competencemodel over time in EFL reading tests. Language Testing, 25(2), 237���272. 261 Pressley, M. , & Afflerbach, P. (1995). Verbal protocols of reading: The nature ofconstructively responsive reading. Hillsdale, NJ: Lawrence Erlbaum Associates. Purpura, J. E. (1999). Learner strategy use and performance on language tests: A structuralequation modeling approach. Cambridge: Cambridge University Press. R��mhild, A. (2008). Investigating the invariance of the ECPE factor structure across differentproficiency levels. Spaan Fellow Working Papers in Second or Foreign Language Assessment,6, 29���55. Sawaki, Y. , Quinlan, T. , & Lee, Y.-W. (2013). Understanding learner strengths andweaknesses: Assessing performance on an integrated writing task. Language AssessmentQuarterly, 10(1), 73���95. Song, X. , & Cheng, L. (2006). Language learner strategy use and test performance of Chineselearners of English. Language Assessment Quarterly, 3(3), 243���266. Stevens, J. P. (2002). Applied multivariate statistics for the social sciences (4th ed.). London:Lawrence Erlbaum Associates. Swaim, V. S. (2009). Determining the number of factors in data containing a single outlier: Astudy of factor analysis of simulated data. Unpublished PhD dissertation, Louisiana StateUniversity, Louisiana. Tabachnick, B. G. , & Fidell, L. S. (2013). Using multivariate statistics (6th ed.). Upper SaddleRiver, NJ: Pearson Education Inc. Thompson, B. (2004). Exploratory and confirmatory factor analysis: Understanding conceptsand applications. Washington, DC: American Psychological Association. Vandergrift, L. , Goh, C. C. M. , Mareschal, C. J. , & Tafaghodtari, M. H. (2006). Themetacognitive awareness listening questionnaire: Development and validation. LanguageLearning, 56(3), 431���462. Zhang, L. M. (2014). A structural equation modeling approach to investigating test takers���strategy use and their EFL reading test performance. Asian EFL Journal, 16(1), 153���188. Zhang, L. M. (2017). Metacognitive and cognitive strategy use and reading comprehension: Astructural equation modelling approach. Singapore: Springer Nature Singapore Pte Ltd. Zwick, W. R. , & Velicer, W. F. (1982). Factors influencing four rules for determining the numberof components to retain. Multivariate Behavioral Research, 17(2), 253���269. Zwick, W. R. , & Velicer, W. F. (1986). Comparison of five rules for determining the number ofcomponents to retain. Psychological Bulletin, 99(3), 432���442.