linguistic correlates of second language proficiency gor, & jackson... · linguistic correlates...

28
Studies in Second Language Acquisition, 2012, 34, 99–126. doi:10.1017/S0272263111000519 © Cambridge University Press 2012 99 LINGUISTIC CORRELATES OF SECOND LANGUAGE PROFICIENCY Proof of Concept with ILR 2–3 in Russian Michael H. Long, Kira Gor, and Scott Jackson University of Maryland With Russian as the target language, a proof of concept study was undertaken to determine whether it is possible to identify linguistic features, control over which is implicated in progress on the Inter- agency Linguistic Roundtable (ILR) proficiency scale, thereby better to inform the instructional process. Following its development in an instrumentation study, a revised version of a computer-delivered bat- tery of 33 perception and production tasks was administered to 68 participants—57 learners between levels 2 and 3 (21 at ILR 2, 18 at 2+, and 18 at 3) on the ILR scale, and 11 native speaker controls— whose proficiency was tested via an ILR oral proficiency telephone interview. The tasks sampled subjects’ control of Russian phonology, morphology, syntax, lexis, and collocations. Relationships between control of the linguistic features and the ILR levels of interest were assessed statistically. All 33 tasks, 18 of which assessed learners’ abilities in perception and 15 of which assessed their abilities in pro- duction, were found to differentiate ILR proficiency levels 2 and 3, The research reported was supported by funding from the University of Maryland’s Center for Advanced Study of Language and from the School of Languages, Literatures, and Cultures. For their valuable participation at various stages of the project, the authors thank Sun-Young Lee, Jennifer Koran, Tatyana Vdovina, Svetlana Cook, Vera Malyushen- kova, Olena Chernishenko, and Jeffrey Witzel. Crucial cooperation was provided by Karen Evans-Romaine, Director of the Middlebury Russian Summer School; Kevin Hart, Chair of the Department of German Studies and Slavic Languages; and Grant Lundberg of Brigham Young University. Valuable comments on several dimensions of the research were pro- vided by Catherine Doughty, Amy Weinberg, Joe Danks, Fred Eckman, and Greg Iverson. Address correspondence to Michael Long, School of Languages, Literatures, and Cultures, 3124 Jiménez Hall, University of Maryland, College Park, MD 20742; e-mail: [email protected].

Upload: doannhi

Post on 11-Jul-2018

234 views

Category:

Documents


0 download

TRANSCRIPT

Studies in Second Language Acquisition , 2012, 34 , 99– 126. doi:10.1017/S0272263111000519

© Cambridge University Press 2012 99

LINGUISTIC CORRELATES OF SECOND LANGUAGE PROFICIENCY

Proof of Concept with ILR 2–3 in Russian

Michael H. Long , Kira Gor , and Scott Jackson University of Maryland

With Russian as the target language, a proof of concept study was undertaken to determine whether it is possible to identify linguistic features, control over which is implicated in progress on the Inter-agency Linguistic Roundtable (ILR) profi ciency scale, thereby better to inform the instructional process. Following its development in an instrumentation study, a revised version of a computer-delivered bat-tery of 33 perception and production tasks was administered to 68 participants—57 learners between levels 2 and 3 (21 at ILR 2 , 18 at 2+ , and 18 at 3 ) on the ILR scale, and 11 native speaker controls—whose profi ciency was tested via an ILR oral profi ciency telephone interview. The tasks sampled subjects’ control of Russian phonology, morphology, syntax, lexis, and collocations. Relationships between control of the linguistic features and the ILR levels of interest were assessed statistically. All 33 tasks, 18 of which assessed learners’ abilities in perception and 15 of which assessed their abilities in pro-duction, were found to differentiate ILR profi ciency levels 2 and 3 ,

The research reported was supported by funding from the University of Maryland’s Center for Advanced Study of Language and from the School of Languages, Literatures, and Cultures. For their valuable participation at various stages of the project, the authors thank Sun-Young Lee, Jennifer Koran, Tatyana Vdovina, Svetlana Cook, Vera Malyushen-kova, Olena Chernishenko, and Jeffrey Witzel. Crucial cooperation was provided by Karen Evans-Romaine, Director of the Middlebury Russian Summer School; Kevin Hart, Chair of the Department of German Studies and Slavic Languages; and Grant Lundberg of Brigham Young University. Valuable comments on several dimensions of the research were pro-vided by Catherine Doughty, Amy Weinberg, Joe Danks, Fred Eckman, and Greg Iverson.

Address correspondence to Michael Long, School of Languages, Literatures, and Cultures, 3124 Jiménez Hall, University of Maryland, College Park, MD 20742; e-mail: [email protected] .

Michael H. Long, Kira Gor, and Scott Jackson100

and a subset was found to also distinguish levels 2 and 2+ , and 2+ and 3 . On the basis of the results, a checklist of linguistic features pegged to profi ciency levels was produced that can be useful for syllabus designers, teachers, and learners themselves as well as

providing the basis for future diagnostic tests.

Advanced speakers of languages critical for a wide variety of important professional tasks are in high demand but short supply the world over. With increasing frequency, rapidly changing geopolitical circumstances produce unanticipated surge requirements in less commonly taught languages (LCTL), in particular, not to mention rarely taught or never previously taught languages. In some cases, no trained teachers or teaching materials are available, and little is known about what to teach or in what order. These problems are not limited to LCTLs. Few data exist on the appropriate linguistic content, or on psycholinguistically defensible learning sequences—especially at the advanced levels—for courses in commonly taught languages like English, Chinese, French, or Spanish. Nor do data exist on the linguistic knowledge implicated in the global ratings on the various profi ciency scales increasingly widely used to measure current learner abilities.

In many languages, with larger or smaller native speaker populations, commonly taught or not, there is an urgent need to develop new speakers and to raise the ability levels of existing ones. To give some idea of the problem, the typical American language and literature major is lucky to achieve level 2 on the Interagency Language Roundtable (ILR) scale after four years of study. Median attainment after four years of harder languages for English-speaking adults, such as Chinese, Arabic, Russian, Korean, or Japanese, is only ILR 1 , with attrition typically set-ting in quickly thereafter. The U.S. government, however, estimates that ILR 3 is the minimum profi ciency level required for most employees to be able to perform their work satisfactorily in a foreign language (Brecht & Rivers, 2000 , 2005 ). It is clear that there is a massive language gap to be bridged, even among graduating language majors who enter private industry or government service, and the problem is com-pounded by the fact that most business experts and government em-ployees are not language majors in the fi rst place. It would be very valuable for teachers, textbook writers, testers, and the learners them-selves to know precisely what is learnable, and therefore teachable, at a given stage in development. Additionally, it would be valuable to know what is required of the individual student and, more broadly, of the general student population to move from one profi ciency level to an-other in a given language, and how the picture differs for heritage and nonheritage learners.

Linguistic Correlates of L2 Profi ciency 101

Important practical applications aside, the whole notion of profi -ciency in a second language (L2) has long been contested in the SLA and applied linguistics literature. SLA theorists, such as Pienemann ( 1985 ), have questioned its utility, arguing that it is an amorphous con-struct, lacking operationalization and, possibly, operationalizability, whereas learners’ current stage and subsequent progress on linguisti-cally represented developmental indices is psycholinguistically mean-ingful and objectively measurable. Similar arguments have been made for more than 20 years in the language testing literature, where reli-ability and validity problems—with the measures employed to assess learners’ abilities and with the scales themselves—have been noted by many researchers (see Alderson, 2005 , 2007 ; Bachman, 1988 ; Fulcher, 1996 , 2004 ; Hulstijn, 2007 ; Lantolf & Frawley, 1985 , 1988 , 1992 ; Pienemann, 1985 ; Pienemann, Johnston, & Brindley, 1988 ). A demon-stration of relationships, if any, between L2 learners’ development of control of grammatical features and the same learners’ advances in global profi ciency ratings would be of interest to theorists, researchers, and practitioners alike.

The Profi ciency Problem

Despite critiques by SLA theorists and well-known reliability and valid-ity problems, global ratings of overall second language abilities in terms of steps on various L2 profi ciency scales are used in many coun-tries by large public and private employers, such as militaries (Lett, 2005 ), intelligence agencies, and corporations, as well as by academic gatekeepers. Among the best known are the ILR and ACTFL (American Council for the Teaching of Foreign Languages, 1985 ) scales in the United States, the Canadian Benchmarks (Pawlikowska-Smith, 2000 ), the International Second Language Profi ciency Ratings (Wylie & Ingram, 1999 ) in Australia, and the Common European Framework of Reference for Languages: Learning, Teaching and Assessment (CEFR; Council of Europe, 2001 ) in Europe. Profi ciency ratings are routinely accepted, be it rightly or wrongly, in some quarters as a rough estimate of an applicant’s ability to perform a job, as a partial basis for assigning an individual to a particular rung on a pay scale, and as a criterion used to group learners for instructional purposes or to exempt them from instruction altogether.

Although such scales are attractive to end users in part because of their functional orientation, because of their (apparent) simplicity, and because ratings—usually encoded in single phrases or global nu-meric values—can ostensibly be calibrated, equated, and compared, they also have disadvantages. Most scales are not empirically based but, rather, depend on the intuitions of teachers or experienced

Michael H. Long, Kira Gor, and Scott Jackson102

testers. Profi ciency ratings found in guidelines and manuals are not based (or are minimally based) on linguistic criteria (e.g., accuracy, syntactic complexity, control of specifi c linguistic elements, lexical depth, or collocational appropriateness), but rather are based on either functional criteria (e.g., the individual can linguistically handle a situation with complications or can give directions) or discourse structure (e.g., the individual speaks in paragraphs or uses extended discourse). In either case, they rely on the subjective judgments of testers who ultimately decide whether or not a task has been success-fully completed.

From a language-learning or language-teaching perspective, scales and ratings based on the scales are unhelpful. Numerical ratings, like 1+ or 2 , letter grades, like B+ or A− , or opaque descriptors, like Threshold and Vantage or Advanced Low or Intermediate High , may mean some-thing to the insiders that administer the tests, but they mean little or nothing to either the learners who need to improve their ratings or their teachers (unless they are trained testers themselves). They are profi ciency scales, after all, and are not intended to provide diagnostic information.

When a student is told that he or she is a 1+ in Arabic, Chinese, Russian, or Spanish on the ILR scale, for example, and is given a certain period of time to improve, what precisely does this student and his or her teacher need to work on to reach level 2 , 2+ , or 3 ? What sorts of phonological problems, grammatical constructions, vocabulary items, collocations, or registers must be mastered at each level? And of poten-tially still wider interest, do the same or similar features determine ratings across languages? Do heritage and nonheritage learners exhibit comparable profi les and share the same linguistic problems within a given profi ciency range? Which linguistic features does a rating imply that a learner controls, in perception or production, and which ones remain problematic? Which ones need to be included in a syllabus and pedagogic materials for learners in the 1+ to 2 or any other range in language X, in typologically similar languages, or in all languages? In short, what are the linguistic correlates of profi ciency?

One potential answer to the question lies in the intuitions of experi-enced classroom teachers. Unfortunately, however, intuitions differ; beginning teachers are inexperienced, and experienced ones are partic-ularly few and far between, as are pedagogic materials, when it comes to LCTLs, especially at advanced levels. Experienced and inexperienced teachers, alike, express a desperate need for solid information on the appropriate content of courses for the languages for which they are responsible. Impressionistic judgments, most would agree, are no more justifi ed as a substitute for empirical fi ndings in language teaching than in mathematics or science courses, or in the provision of other impor-tant services, from architecture to medicine.

Linguistic Correlates of L2 Profi ciency 103

An alternative solution might be to refer to the published descrip-tions that accompany global profi ciency ratings. This turns out to be of limited help as well, because the descriptions are vague and involve a miscellany of overlapping linguistic criteria, nonlinguistic abilities and subabilities, functions, tasks, and skills—the linguistic correlates of which are all unknown. To compound the problem, the ability to do X or Y (hold a casual conversation, deal with routine workplace language needs, etc.) often seems to have been arbitrarily assigned to a partic-ular level on a scale; these tasks clearly have relevance at all levels, depending on the complexity of any particular instance, and can be per-formed by learners at all levels, albeit with greater or lesser success. In other words, the scales’ functional correlates lack face validity.

Moreover, these scales are marked by a pervasive use of undefi ned terms and expressions, open to widely varying interpretations. To illus-trate, it is necessary to consider the ILR description of speaking abilities at ILR 2 .

Speaking 2 (Limited Working Profi ciency) Able to satisfy routine social demands and limited work requirements. Can handle routine work-related interactions that are limited in scope . In more complex and sophisticated work-related tasks, language usage generally disturbs the native speaker . Can handle with confi dence, but not with facility , most normal, high-frequency social conversational situations including extensive, but casual conversations about current events, as well as work, family, and autobio-graphical information. The individual can get the gist of most everyday con-versations but has some diffi culty understanding native speakers in situations that require specialized or sophisticated knowledge . The individ-ual’s utterances are minimally cohesive . Linguistic structure is usually not very elaborate and not thoroughly controlled ; errors are frequent . Vocabu-lary use is appropriate for high-frequency utterances, but unusual or impre-cise elsewhere . (Interagency Language Roundtable, n.d.) 1

Similar problems can be found in the descriptions for all four skills, at all 10 profi ciency levels and sublevels, on the ILR scale, and in the com-parable descriptions of levels of all such scales.

Additionally, the characterizations are sometimes so vague and general as to require considerable imagination on the reader’s part. For example, the description of CEFR level B1 (Threshold) reads as follows:

B1: Can understand the main points of clear standard input on familiar matters regularly encountered in work, school, leisure, etc. Can deal with most situations likely to arise whilst travelling in an area where the language is spoken. Can produce simple connected text on topics which are familiar or of personal interest. Can describe experiences and events, dreams, hopes & ambitions and briefl y give reasons and explanations for opinions and plans. (Council of Europe, n.d.)

Michael H. Long, Kira Gor, and Scott Jackson104

“Clear standard input on familiar matters regularly encountered in work, school, leisure, etc.” and “situations likely to arise whilst travel-ling in an area where the language is spoken,” for example, can obvi-ously mean very different things to different people and in practice. Some notions of linguistic competencies in terms of grammatical accu-racy, phonological control, and vocabulary range and control are pro-vided along with the CEFR profi ciency scale (Council of Europe, 2001 ), but these are very general correlates of the CEFR levels and are not language specifi c.

The CEFR scale was developed somewhat differently from other pro-fi ciency scales. Functional descriptors (e.g., “can produce stretches of language with a fairly even tempo” and “can defi ne the features of some-thing concrete for which he/she can’t remember the word,” http :// www . coe . int / lang ) were culled from various existing scale descriptors and were empirically scaled by North and colleagues (North, 2000 ; North & Schneider, 1998 ) based on data from surveys and video assessments collected from nearly 300 teachers. Although an improvement over scales whose descriptors are ordered and leveled on the basis of pure intuition, the study of teachers’ judgments still does not capture the kind of detailed information on structural linguistic competencies based on learner performance that the Linguistic Correlates of Profi ciency (LCP) project targeted.

In a sense, the fact that the scales are couched in quasi-communicative terms is a positive development. The recognition that languages are learned to be used—at least where adults are concerned—often for important, quite specialized purposes, also refl ects a change in the increased popularity of communicative approaches to language teaching and the profi ciency movement since the 1960s (Higgs, 1984 ; Musumeci, 2009 ). To describe points on a profi ciency scale in terms of a learner’s ability to produce utterances containing verbs marked for simple past tense, “understand utterances containing the third condi-tional” (North & Schneider, 1998 ), and so forth, with no reference to the speech acts, topics, tasks, or situations in which use of the gram-matical features occurred, would be even less informative, after all, and refl ect a synthetic, noncommunicative approach to language teaching and testing unsupported by theory and research fi ndings in SLA (Long, 1991, 2007 , 2009 ). Either forms or uses alone are inade-quate; what is needed is a combination, with linguistic features linked systematically to development described in terms of some kind(s) of functional abilities.

The goal of the LCP project was to provide empirically based checklists of linguistic correlates of ILR profi ciency, and thereby to increase the transparency and utility of ratings and, ultimately, im-prove the teaching, learning, and testing of LCTLs. The checklists would detail much of the control of linguistic elements needed to

Linguistic Correlates of L2 Profi ciency 105

move from one level to another on a profi ciency scale and would provide syllabus designers, materials writers, teachers, and learners with the information they need. The immediate goal was proof of con-cept: to determine the feasibility of identifying control of linguistic features, in perception and production, implicated in progress within a relatively narrow profi ciency range ( ILR 2 , 2+ , and 3 ), across groups of learners, and in a demonstration language (Russian). The research would not itself produce diagnostic tests with the psychometric properties required by reliable and valid measures, but it would sub-sequently be a relatively straightforward undertaking to produce such measures out of the data-collection batteries developed for the study.

When designing the study, the researchers were mindful of the need pointed out by Bachman ( 1988 ) to avoid the circularity of defi ning more profi cient speakers as those who can perform more demanding tasks, and defi ning more demanding tasks as those that can be performed (only) by more profi cient speakers. The study was not designed to establish which tasks or task types are better handled by learners of higher or lower global profi ciency, nor were learners assessed in terms of their ability to perform well on the tasks, per se. What was measured was their control of linguistic features; the tasks were simply carriers. Some of the same tasks (translation, elicited im-itation, etc.) were used for different target features and in different linguistic domains (morphology, collocations, etc.). The same learners might exhibit poor control of some features assessed using a transla-tion task, for example, but acceptably high control of other features assessed via the same or another translation task. Thus, the focus throughout was on control of linguistic elements, not success or fail-ure with particular tasks or task types.

It should also be noted that oral profi ciency on the ILR scale is as-sessed via the Oral Profi ciency Interview (OPI) globally, without any reference to the accuracy of individual features that occur in a speech sample. Additionally, given the well-attested avoidance strategy in L2 performance, the absence of a feature in a L2 learner’s output may itself be indicative of underlying diffi culty, although some features, such as the use of case marking in Russian, are impossible to avoid. In this study, correlates can consequently be interpreted as correlates of pro-fi ciency, not aspects of the criteria that profi ciency raters are (impres-sionistically) using.

Previous Research

The problems that have been outlined have inspired at least one other major attempt to deconstruct global profi ciency ratings: the project for

Michael H. Long, Kira Gor, and Scott Jackson106

the development of diagnostic language tests (DIALANG) (Alderson, 2005 ; Alderson & Huhta, 2005 ). DIALANG is a free and open (i.e., publi-cally funded and not secure), Internet delivered, and no-stakes or low-stakes test, of which the volunteer test taker can opt out at any time. DIALANG targets 14 European languages, with results linked to the six levels on the CEFR scale. Crosslinguistic comparability is desirable, but hard to achieve. The test begins by asking learners to assess their own profi ciency via a 75-item vocabulary test (50 real words, 25 pseudo-words), in which learners are required to identify which are real words and which are not. Either alone or combined with the learner’s (optional) responses to a set of 18 can do statements for each of four language skills, the instantaneously scored and communicated results guide the learner to the appropriate level of the main test— easy , medium , or diffi cult —which covers listening, reading, writing (indi-rectly), grammar, and vocabulary. There is no speaking component.

The DIALANG tests are organized in terms of subskills. For example, in the area of reading, one is tested on the ability to distinguish a main idea from supporting detail, the ability to understand text literally, and the ability to make appropriate inferences; all are reported separately, not as an integrated score. Four types of test items are employed: multiple-choice questions, drop-down menus, text entry, and short-answer ques-tions. Test results are delivered to learners in terms of one of the six CEFR profi ciency levels, not as test scores, along with differences between their self-assessed rating and their actual test performance if they com-pleted the optional can do and advisory feedback sections. The advice on how to progress to the next CEFR level, again in terms of subskills, takes the form of tables that show the differences between what learners at the test taker’s current level and at the preceding and following levels on the CEFR scale can supposedly do, along with the kind of texts they can handle, and under what conditions, or with what limitations.

Some of DIALANG project director Alderson’s ( 2005 ) conclusions are sobering. Of particular importance for the present study is the fact that profi ciency-type items (e.g., reading subskills) do not pattern with CEFR level and do not get at what underlies profi ciency:

The results of the piloting of the different DIALANG components showed fairly convincingly that the subskills tested do not vary by CEFR level. It is not the case that learners cannot inference at lower levels of profi ciency, or that the ability to organize text is associated with high levels of profi ciency only. This appeared to be true for all the subskills tested in DIALANG. If subskills are not distinguished by CEFR level, then one must ask whether the development of subskills is relevant to an understanding of how language profi ciency develops and hence to diagnosis. . . . The possibility is worth considering that what needs diagnosing is not language use and the underlying subskills, but linguistic features, of the sort measured in the Grammar and Vocabulary sections of DIALANG. (Alderson, 2005 , p. 261)

Linguistic Correlates of L2 Profi ciency 107

Both the profi ciency levels in the CEFR scale and assignment of items in the diagnostic tests to levels are based on expert judgments of diffi culty—that is, the intuitions of experienced teachers and testers of the languages concerned—not on data on the acquisition of those lan-guages. Empirically based acquisition data, in the form of inventories of objectively recognizable linguistic features, is the desirable approach and the one followed in the LCP project. Because relationships between mastery of such features and profi ciency levels are unknown, a priori assignment of features or profi ciency subskills to levels is unjustifi ed. Charting mastery of sets of features relative to one another is, however, the fi rst order of business, and the relationships of feature mastery to profi ciency scale levels are determined by the data, not by intuition or a priori standard settings, even when, as in the DIALANG project, acceptably high interrater reliability agreement was obtained among the judges of item diffi culty. To set the standard on the basis of the judgment of item raters, however expert, means that one set of intui-tions—the original profi ciency scales—is being equated with another set of intuitions, when what is needed is an empirical basis to both. In the LCP project, the purpose of the research is to unpack levels on the ILR scale and to empirically determine the degree of control of linguistic features, linguistic constructions, and the other objectively measurable holistic abilities (e.g., the ability to comprehend speech in noise) that differentiate them. Subsequently, it would also be very useful to know which classes of problems occur across languages at these levels, and which tend to be specifi c to particular typologically defi ned groups of languages. Such information would be of considerable help to those involved with languages for which no tests are available at advanced levels (or at all), yet for which rapid ramp-up may be required.

A second research effort of relevance to this work, also targeting the CEFR, is described briefl y by Hulstijn and Schoonen ( 2006 ):

It must be borne in mind that the scales of the CEFR are to a large extent based on the intuitions of language teachers about descriptors derived from a number of existing widely used scales of language profi ciency. What the CEFR does not indicate is whether learner performance at the six functional levels listed in Chapter 4 of the CEFR (Council of Europe, 2001 ) actually does exhibit the linguistic characteristics listed in chapter 5, and, more specifi cally, which linguistic features are typical of each of the levels. Answers to these questions are essential if the CEFR is to be successfully implemented across Europe. (p. 7)

This project proposes to focus on seven target languages: English, Finn-ish, French, Swedish, German, Italian, and Dutch. The aims are 1. To fi nd out what learners can and cannot do in all four skills (listening,

speaking, reading, and writing) at each of the six levels in the CEFR scales,

Michael H. Long, Kira Gor, and Scott Jackson108

which requires administration of tasks of a close format (not just “free” speech in an oral interview, for instance) and the measurement of both accu-racy and reaction times

2. To develop diagnostic tools 3. To identify commonalities and differences in the linguistic (and paralin-

guistic) profi les of naturalistic and instructed learners The clearly stated intention is to focus on empirical studies of learner performance at the various CEFR levels, although results are not yet available. The report’s title, Scientific Report of ESF-Sponsored Exploratory Workshop: Bridging the Gap between Research on Second Language Acquisition and Research on Language Testing , demonstrates precisely the orientation of the LCP project. The most obvious difference is that whereas the CEFR is a relatively new kid on the block, the ILR scale has been used for many years and is a widely accepted profi ciency yardstick in most branches of the U.S. government, and through the related ACTFL scale, in academia and elsewhere.

THE LINGUISTIC CORRELATES OF PROFICIENCY PROJECT

In light of these considerations, the LCP project was undertaken using Russian as the demonstration language. The purpose was to identify linguistic features implicated in progress on the ILR scale—specifi cally, from ILR 2 to 2+ , and from 2+ to 3 —and the project ultimately aimed to improve the teaching, learning, and testing of high-level listening and speaking abilities. The study focuses on the 2 – 3 range because of a U.S. government initiative that sought to raise the profi ciency level—from 2 to 3 —of the tens of thousands of employees for whom command of a foreign language is considered critical for job performance. The research was intended to demonstrate proof of concept; that is, that despite idio-syncrasies at the level of individuals due to the well-documented sub-stantial variability in interlanguages, it is possible to specify the linguistic features implicated at various profi ciency levels across groups of learners. If successful, the same methodology could then be applied to other profi ciency ranges, as well as to new languages. Given these idiosyn-crasies, it was considered useful if the same battery of measures devel-oped to defi ne LCP could also be used to provide feedback to individual learners (and their teachers) on their particular abilities and problems. 2

For the purposes of the study, the products of such work, LCP, or linguistic profi les, are detailed inventories of linguistic features—phonology, morphology, syntax, vocabulary, and collocations—in per-ception and production, and of abilities such as comprehension of speech in noise, that are typically mastered by, or still posing problems for, adult English-speaking learners of foreign languages at various sep-arately assessed levels on a profi ciency scale.

Linguistic Correlates of L2 Profi ciency 109

Although both the DIALANG and LCP projects are diagnostic (see Kunnan & Jang, 2009 ), they differ in a number of ways. Perhaps the most crucial difference is that whereas DIALANG focused on subskills (e.g., the ability to distinguish a main idea from supporting details), the LCP project focuses on language descriptors. The fi ndings of the LCP project make it possible to follow the development of individual fea-tures. For some features tested, an acquisition point was established for a given testing modality (perception or production) that used the traditional 80% accuracy threshold in SLA research. Sensitivity to viola-tions in the appropriate use of verbs of motion does not, for example, develop even in ILR 3 learners, whereas sensitivity to violations in deri-vational morphology (illegal derivations) develops at ILR 2+ . Any given ILR profi ciency can be characterized by a certain level of attainment for a set of features. The study’s fi ndings for Russian illustrate the typical errors in feedback to learners and teachers: 1. ILR 2 : Can deal with low noise level with 80% accuracy only in supporting

contexts, but not in low-predictability contexts (60% accuracy). 2. ILR 2+ : Can conjugate real and nonce regular default -aj- class verbs with 80%

accuracy but has diffi culty with irregular -a- verbs, with accuracy only at 20%. 3. ILR 3 : In subjectless sentences, is sensitive to mismatches in case in imper-

sonal sentences with 80% accuracy but is only sensitive to mismatches in verb agreement with the expletive subject with 50% accuracy.

Several other features differentiate the two projects. Whereas DIALANG

focuses on listening, reading, and writing, the LCP project focused on spoken production and perception only. Because this was a feasibility study, and because some end users of other LCTLs—such as Chinese—might not possess or be interested in L2 literacy, no use of the L2 writing system was allowed in the computer-delivered test bat-teries themselves (although non-Roman alphabets would eventually be usable in feedback to teachers and some individual learners). Whereas DIALANG focuses on 14 languages, some commonly taught, some not, the LCP project targeted three profi ciency levels in just one, Russian. The aim was to determine whether fi ne linguistic distinctions could be made reliably within a relatively narrow profi ciency range, as opposed to the entire profi ciency range from beginners to near-native, targeted in the DIALANG work. Whereas several of the 14 DIALANG languages have substantial corpora, expertise in SLA, and well-documented language teaching and language testing behind them, relevant literature on Russian was comparatively rare. Thus, the project team, when gen-erating initial sets of items, was forced to rely on a mix of introspections and retrospections by advanced learners and experienced teachers, on linguistic analyses in books and journals, on the relatively few available SLA studies of the acquisition of Russian, and on inferences drawn from

Michael H. Long, Kira Gor, and Scott Jackson110

better-studied L2s, like English, German, Spanish, and Japanese, as to what might be worth probing.

There were important methodological differences as well. DIALANG, for example, used expert judgments to assign items to this or that CEFR profi ciency level, but the LCP project was performance based; that is, it focused on what learners can or cannot do at a certain ILR profi ciency level. This approach is labor intensive, but it avoids the potential circularity of raters classifying linguistic features according to the use they make of them in assigning profi ciency ratings when administering the OPI or ACTFL interview. Additionally, DIALANG employs four main types of test items: multiple-choice questions, drop-down menus, text entry, and short-answer questions; the LCP project experimented with more than 20 task types, some familiar in L2 assessment, others drawn from research in SLA involving very advanced learners (e.g., work on the critical period hypothesis) and in experimental L1 psycholinguistics.

THE STUDY

The two most signifi cant challenges to increasing the talent pools in for-eign languages, especially LCTLs, are that (a) L2 learners often fi nd func-tional profi ciency at level 2 and above on the ILR scale diffi cult to achieve, and (b) most teachers, testers, and learners lack anything more than intu-itive knowledge of the level of control of the linguistic features—phonology, morphology, syntax, lexis, and collocations—implicated in progress on the ILR scale, particularly ILR 2 , 2+ , and 3 . The intuitions of experienced professionals are often sound but sometimes are not, and in any case constitute an inadequate basis for such an important aim to meet what are often critically important foreign language needs.

The LCP project examined the linguistic features of the interlanguages of English-speaking learners of one LCTL, Russian, in the ILR 2 – 3 profi -ciency range. The study was guided by two principle research questions: 1. Which linguistic features of Russian correlate with ILR profi ciency levels

2 , 2+ , and 3 on the ILR scale? 2. At what level of control do these linguistic features correlate with ILR profi -

ciency levels 2 , 2+ , and 3 on the ILR scale? Through an instrumentation study conducted in 2007, in addition to testing the viability of a battery of perception and production tasks, the research team had tentatively identifi ed a set of linguistic features of Russian in the phonological, morphological, syntactic, lexical, and collocational domains relevant for progress by learners and heritage speakers in the ILR 1+ to 4 profi ciency range. The main study employed a larger subject pool within the narrower ILR 2 – 3 range, along with a

Linguistic Correlates of L2 Profi ciency 111

revised version of the data-collection battery (improved through stan-dard item analyses, etc.), which again consisted of a number of tasks that tapped both perception and production abilities. The production involved was very limited and amounted to just that which was required to assess active control of the targeted features. Each of the 33 tasks in the revised battery was designed to probe knowledge of one or more features. Relationships between linguistic features and the ILR profi -ciency levels of interest, which emerged from the fi rst study, would be confi rmed, or not, as the case may be.

METHOD

Instrumentation: Task Types, Tasks, and Items

The design of instruments that measure L2 abilities becomes harder with each proficiency increase, especially when LCTLs are involved. Due to options in topic control and avoidance—more skillfully ex-ploitable by advanced learners—such relatively open-ended proce-dures as oral interviews and picture descriptions tend to elicit samples of what speakers can do and are less useful for finding out what they cannot do. This means that a satisfactory data-collection battery or diagnostic test, especially one suitable for use with advanced learners, will need to include a number of closed tasks designed to probe for the small, often unnoticed flaws in advanced language proficiency, with avoidance precluded by the nature of the elicitation tasks. Discrete-point tasks of several kinds will be needed to identify gaps in learners’ linguistic repertoires, and to measure accurate and appropriate usage and understanding in various pho-nological, morphological, syntactic, lexical, and collocational do-mains. Several task types with proven track records in SLA research (grammaticality judgment, elicited imitation, etc.) were employed again in the LCP research, although a number of task types were imported from other fields, notably first language (L1) research in psycholinguistics, and adapted for use with adult L2 learners where necessary. A full description of, and rationale for, each of the 38 task types trialed in the instrumentation study is beyond the scope of this article but is available in Long et al. ( 2006 ).

There were additional restrictions on what was possible with regard to instrumentation. The data-collection batteries created for the LCP research, like any diagnostic tests that might eventually be based on them, were required to be deliverable via computer, so as to facilitate their use by students and language testers in some situations (e.g., distance learning). The batteries could employ written English stimuli (e.g., in a translation task) but had to avoid use of written forms of the

Michael H. Long, Kira Gor, and Scott Jackson112

target language, given that some learners and test takers at lower profi -ciency levels might only need, and only possess, functional ability in listening or speaking. The emphasis, therefore, was on combinations of English visual and Russian oral stimuli.

Just as innovations would be required to develop the task types suit-able to investigate and assess advanced foreign language profi ciency, so would creativity be needed in selection of the linguistic domains and features to be targeted. The most comprehensive diagnostic studies of advanced profi ciency when the project began had dealt with two East Asian languages, Japanese and Korean, with inevitable biases in the choice of linguistic features included and excluded in their assessment. Kanno, Hasegawa, Ikeda, Ito, and Long ( 2007 ) and Lee, Kim, Kong, Hong, and Long ( 2005 ) had tested performance on honorifi cs and mimetics, for example, features absent in Indo-European languages. 3 There is a need for both general and language-specifi c approaches in the search for LCP levels. A major goal of the LCP project was to develop instru-ments that included both types of features and were broad in spectrum and fl exible in their application to individual languages.

The global nature of ILR OPI testing precludes the use of any nonfunc-tional descriptors, and consequently, no lists of language-specifi c gram-matical diffi culties that correspond to the ILR levels are available. A potential option, a so-called noncompensatory core of grammatical features (e.g., basic use of infl ections in cases, verbal conjugation, and verbal aspect), is not employed in ILR testing so could not be used either. Several alternative sources of ideas were consequently tapped when items for the pilot version of the data-collection battery for the project were developed. They included the refl ections of experienced teachers of Russian, introspections (obtained through interviews) of high-achieving learners of the language concerning residual linguistic diffi culties, Russian corpora, Russian textbooks for advanced learners, extensive consultations with Russian language instructors in the aca-demic and government sectors, research fi ndings on late-acquired items in L1 acquisition and late-acquired or rarely acquired features in adult L2 acquisition, and, because of its frequent focus on very ad-vanced learners, the literature on critical periods in SLA, although most of that work concerns English. The features selected for testing were not linked explicitly to descriptors of the ILR scale, of course, because prior to this study, no such linkage had ever been established.

The pilot version of the Russian data-collection battery consisted of 38 tasks—phoneme production, AXB discrimination, accent detection, speech in noise, grammaticality judgment, lexical decision with and without priming, multiword unit completion, and so forth—with be-tween 10 and 400, although usually 20–50, items per task. Considerable care was taken to control such potentially confounding variables as utterance length, lexical frequency, and structural complexity. 4 The

Linguistic Correlates of L2 Profi ciency 113

battery consisted of four blocks, each of which was approximately 1 hr in length and contained speech perception or production tasks in the following sequence: listening, speaking, listening, and speaking. The four blocks were separated by pauses that allowed subjects to take short breaks. Four different versions of the test battery were created to balance different conditions, which varied from task to task, and the oral input of the whole battery included alternating male and female voices.

The 2006 instrumentation study obtained data from a total of 70 participants: 36 learners of Russian and 24 heritage speakers—with ratings that ranged from 1 to 4+ on the ILR scale (39 in the 2 – 3 range)—as well as 10 L1 Russian speaker controls. Participants were recruited through newspaper advertisements and by word of mouth. All were college educated and ranged in age from 18 to 56. Standard item analyses and additional statistical analyses (Cronbach’s alphas and ANOVAs) showed that most tasks had acceptably high internal reli-ability, that students at different levels on the ILR profi ciency scale differed in their scores on most tasks, and that signifi cant effects for ILR level were obtained on some tasks in the lower profi ciency ranges, some in the higher ranges, and some across the entire range. 5 These results had been obtained despite the fact that the data were collected (a) from a truncated sample in terms of profi ciency (most subjects were in the ILR 2 – 4 range), thereby reducing the scope of variation possible and the likelihood of obtaining statistically signifi -cant fi ndings; (b) sometimes with fewer items than could easily be included in the future if a task were retained (e.g., because of the rigorous native speaker acceptance criterion for item retention—no more than two L1 speakers could disagree on an item for it to be retained—a number of items in some tasks had to be discarded); and (c) with data missing as a result of an occasional technical failure in the recording process.

The main study employed a revised and rerecorded version of the data-collection battery and was conducted in 2008. Of the original 38, 30 tasks—improved as a result of the item analyses or by creating addi-tional items—were retained. Three additional elicited imitation (EI) tasks were added, for a total of 33. Each of the 33 tasks tested one fea-ture (often with several subfeatures), and most tasks were used to test a variety of linguistic features. 6

Participants and Procedures

Within the narrower ILR 2 – 3 profi ciency range, a larger subject pool was employed than in the instrumentation study. Data were collected from a total of 68 participants: fi fty-seven learners (21 at ILR 2 , 18 at 2+ , and

Michael H. Long, Kira Gor, and Scott Jackson114

18 at 3 ), all of whom were also tested for oral profi ciency via an ILR oral profi ciency telephone interview conducted by certifi ed testers, and 11 L1 speaker controls (see Table 1 ). For all the analyses, only the learners’ data are presented, because the L1 speakers were relevant only as controls to check that the tests operated as expected. For the linear discriminant analyses (LDAs), one ILR 2 participant (female) was dropped from the analysis for missing several of the tests due to tech-nical problems. Potential participants were recruited via newspaper advertisements, through word of mouth, and from among faculty and students in Russian programs at two American universities. They were then screened into the study on the basis of their performance in an OPI telephone interview by a certifi ed tester. The data-collection battery, programmed using the DMDX software package, was admin-istered via computer in four 1-hr blocks, with short breaks between blocks. 7

ANALYSES

Error rates were analyzed using generalized linear mixed models with a logit link function (henceforth logit mixed models). These models allow for important random effects, such as subject and item, to be modeled, unlike ordinary logistic regression, and they are better suited to analyze categorical data, such as error rates, than repeated-measures ANOVA (see Agresti, 2002 , and Jaeger, 2008 , for discussion and further refer-ences). Models were fi tted in the statistical platform R (R Development Core Team, 2008 ), using the lmer function from the lme4 library (Bates & Sarkar, 2007 ). Model selection and testing were performed such that ILR level was always retained in the model (to test the hypothesized effects) and was treated as a categorical, ordinal variable, using dif-ferent contrast codings to test the three comparisons shown in the summary chart ( ILR 2 vs. 2+ , 2+ vs. 3 , and 2 vs. 3 ).

Table 1. Russian participant distribution by age, gender, and profi ciency

OPI score

Female Male

Total Participants Min. age

Max. age Participants

Min. age

Max. age

2 4 23 30 17 21 65 21 2+ 6 26 36 12 21 60 18 3 4 31 46 14 22 59 18 L1 8 22 30 3 24 28 11 Grand total 22 46 68

Note. . OPI = Oral Profi ciency Interview.

Linguistic Correlates of L2 Profi ciency 115

Because many tasks included one or more fi ne-grained factors in the design, model comparison was used to trim unnecessary pre-dictor variables. Model comparison was also used to evaluate the inclusion of random effects of subject and item on both intercept and slope of the other variables in the model. 8 Following Pinheiro and Bates ( 2000 ), models were compared using χ -square likelihood ratio tests (a complete model of the results is available on request from the author), and predictors were only retained when these tests reached signifi cance at p < .05. The result was that many of the fi ne-grained distinctions designed into the tests did not reach signifi -cance and thus were not part of the fi nal models. For each of the fi nal models, ILR rating was used to predict accuracy in order to test whether there is some relationship between general profi ciency and test performance.

LDAs were carried out as a complement to the logit mixed model analyses, in order to assess how successfully these measures correctly classify participants into the ILR levels determined by the OPI. In general, tests’ accuracy scores were moderately to highly correlated with one another; thus to enter in large numbers of tests as predictors would run afoul of multicollinearity problems. Out of the 465 unique pairwise cor-relations between measures, 164 were correlated at r > 0.50. The pair-wise correlations of more than 0.70 are listed in Table 2 .

Because of these relatively high levels of correlations between tests, LDAs were built to start with single predictors, rather than with entire sets of tests, and predictors were added only when classifi cation accu-racy improved. Three separate LDAs were performed, for three dif-ferent binary contrasts: classifi cation of ILR 2 versus ILR 3 , ILR 2 versus ILR 2+ , and ILR 2+ versus ILR 3 . Finally, in order to assess internal reli-ability, Cronbach’s alpha was computed for each test.

RESULTS

Overall, the results indicated that every test except for one—picture-word elicitation, a phonology production test—showed signifi cant differences between ILR levels 2 and 3 at p < .05. Out of 33 tests, 14 tests showed sig-nifi cant differences between ILR levels 2 and 2 +, and 21 showed signifi cant differences between ILR levels 2 + and 3 . The perception tests that showed signifi cant effects are summarized in Table 3 , and the production tests are summarized in Table 4 . Complete tables of detailed results for all the tests are available on request from the author.

The values of Cronbach’s alpha for the majority of the production tasks are in the high range (0.711–0.978, M = 0.873, Mdn = 0.899), which indicates high internal consistency. Cronbach’s alpha values are typi-cally lower for the perception tasks (0.261–0.925, M = 0.693, Mdn = 0.728),

Michael H. Long, Kira Gor, and Scott Jackson116

Tab

le 2

. P

airs

of t

ests

cor

rela

ted

at

r = 0

.70

or g

reat

er

Pai

rs o

f cor

rela

ted

tas

ks

Pea

rson

r

31: P

arti

cip

les

(elic

ited

imit

atio

n)

32: V

erb

al a

dve

rbs

(elic

ited

imit

atio

n)

0.83

5 10

: Ph

onol

ogic

al p

rim

ing

(lex

ical

dec

isio

n)

2: D

eriv

atio

nal m

orp

hol

ogy

(lex

ical

dec

isio

n)

0.81

8 20

: Asp

ect

(sen

tenc

e co

mp

leti

on)

25: V

erb

al a

dve

rbs

(Eng

lish

to

Rus

sian

tra

nsla

tion

) 0.

773

32: E

licit

ed im

itat

ion—

verb

al a

dve

rbs

33: S

ubor

din

ate

clau

ses

(elic

ited

imit

atio

n)

0.73

4 10

: Ph

onol

ogic

al p

rim

ing

(lex

ical

dec

isio

n)

22: V

erb

s of

mot

ion

(res

tric

ted

con

trol

) 0.

734

20: A

spec

t (s

ente

nce

com

ple

tion

) 28

: Ver

bs

of m

otio

n (s

ente

nce

com

ple

tion

) 0.

731

31: P

arti

cip

les

(elic

ited

imit

atio

n)

22: V

erb

s of

mot

ion

(res

tric

ted

con

trol

) 0.

725

31: P

arti

cip

les

(elic

ited

imit

atio

n)

33: S

ubor

din

ate

clau

ses

(elic

ited

imit

atio

n)

0.71

7 26

: Id

iom

cor

rect

ion

28: V

erb

s of

mot

ion

(sen

tenc

e co

mp

leti

on)

0.71

5 23

: Plu

ral n

ouns

(En

glis

h t

o R

ussi

an t

rans

lati

on)

24: N

umer

als

(Eng

lish

to

Rus

sian

tra

nsla

tion

) 0.

707

27: S

ente

nce

collo

cati

ons

(Eng

lish

to

Rus

sian

tra

nsla

tion

) 28

: Ver

bs

of m

otio

n (s

ente

nce

com

ple

tion

) 0.

707

20: A

spec

t (s

ente

nce

com

ple

tion

) 7:

Ver

bal

pre

fi xes

(m

ulti

ple

ch

oice

) 0.

707

25: V

erb

al a

dve

rbs

(Eng

lish

to

Rus

sian

tra

nsla

tion

) 7:

Ver

bal

pre

fi xes

(m

ulti

ple

ch

oice

) 0.

703

27: S

ente

nce

collo

cati

ons

(Eng

lish

to

Rus

sian

tra

nsla

tion

) 2:

Der

ivat

iona

l mor

ph

olog

y (l

exic

al d

ecis

ion)

0.

700

Linguistic Correlates of L2 Profi ciency 117

and for tasks with smaller numbers of items. Production tasks tend to differentiate one more sublevel contrast, in addition to levels 2 and 3 , whereas some of the perception tasks do not differentiate, or only mar-ginally differentiate, sublevels within the 2 – 3 range. These results, taken together, suggest that the production data are more effi cient and reli-able. However, given that production data need a human scorer and are therefore more resource consuming, a trade-off of higher reliability and lower cost-effectiveness should be considered in the future develop-ment of similar diagnostic tasks.

A comparison of accuracy scores in the perception and production tasks confi rms the well-documented observation that accuracy in pro-duction experiments typically lags behind accuracy in perception. Ad-ditionally, the fact that some tasks were used with different features makes it possible to directly compare learners’ control of the features in question. For example, in three EI tasks, participles, verbal adverbs, and different types of subordinate clauses were targeted within exactly the same testing format. Sentence length and target structure posi-tion—participle, verbal adverb, and conjunction—were strictly con-trolled, and vocabulary from the same frequency ranges was used. The accuracy scores show that participles and verbal adverbs were consid-erably more diffi cult for speakers in the ILR 2 – 3 profi ciency range than subordinate clauses. Comparison of the different subordinate clause types within the same EI task shows that, at the same time, the unreal condition presented more diffi culties for level 2 and 2+ learners than the other types. Finally, of all the domains included in the battery, accu-racy scores on tasks that tested control of some kinds of idiomatic language (e.g., collocations, idioms, proverbs, and sayings) were the lowest. This fi nding supports the general notion that idiomatic use

Table 3. Perception tests that demonstrated signifi cant effects of ILR level on accuracy at p < .05

Signifi cant differences

Between ILR 2 and 2+ Between ILR 2+ and 3 Between ILR 2 and 2 +

and 2 + and 3

Lexical decision: verbal adverbs

Primed lexical decision: morphology

Primed lexical decision: phonology

Lexical decision: derivational morphology

Primed lexical decision: semantics

Nonimperative imperative constructions

Participles: telicity Subjectless sentence constructions

Approximate vs. exact numerals

Refl exive verbs Verbal prefi xes Collocations Basic lexicon: nouns Indefi nite nouns

Michael H. Long, Kira Gor, and Scott Jackson118

remains a learning objective for highly profi cient L2 learners at ILR level 3 and above.

Even though the individual test results of the logit mixed models are informative, LDAs provide estimates of how effectively the tests classify subjects into appropriate ILR levels. In these analyses, only overall ac-curacy within each test was analyzed; the fi ne-grained conditions were collapsed. Single-predictor LDAs for each of the 33 tests were used to calculate base rates of classifi cation accuracy, defi ned as the number of correct classifi cations divided by the number of subjects. For all of the analyses, the reported classifi cations are based on leave-one-out cross-validation. The result is a more conservative, robust estimate of the classifi cation accuracy of the model. The tests that resulted in a classifi cation of 0.7 or better on their own in cross-validation are listed in Tables 5 – 7 , for each of the ILR contrasts.

Multivariate LDAs were constructed by starting with the predictor that had the best classifi cation on its own (i.e., the top in each of Tables 5 – 7 , respectively) and then adding additional predictors one at a time, which were only retained in the model if they improved the classifi cation accuracy of the cross-validated LDA. Three models ulti-mately emerged as among the best possible classifi cation models for

Table 4. Production tests that demonstrated signifi cant effects of ILR level on accuracy at p < .05

Signifi cant differences

Between ILR 2 and 2+ Between ILR 2+ and 3 Between ILR 2 and 2 +

and 2 + and 3

Paradigm elicitation: verbs Sentence completion: aspect

Sentence completion: numerals

Speech perception in noise Sentence completion: verbs of motion

L1 to L2 translation: singular vs. plural nouns in sentences

Restricted control: verbs of motion

Idiom correction

Elicited imitation: participles

Elicited imitation: subordinate clauses

L1 to L2 translation: numerals in sentences

L1 to L2 translation: verbal adverbs

L2 to L1 translation: collocations

Linguistic Correlates of L2 Profi ciency 119

this set of data. Note that these are not the three uniquely best models, and that it is possible that some tests could be substituted for these same tests and still achieve similar classifi cation results. These models simply illustrate the most one could reasonably expect from this bat-tery of tests in terms of overall classifi cation accuracy. The models for the three ILR contrasts are given with the classifi cation tables for each contrast in Tables 8 – 10 .

These results show that even with a small fraction of the measures tested in this project, classifi cation rates are quite high. The especially high accuracy of classifi cation between ILR 2 and 3 is somewhat expected, given that this represents a very substantial difference in overall profi -ciency. More interesting is that a small number of measures were also able to classify ILR 2 versus 2+ at 76%, and ILR 2+ versus 3 at better than 80%. This distinction is especially slippery in the ILR system, and 2+ is

Table 5. Tests that classifi ed ILR 2 vs. ILR 3 at better than 0.70 accuracy

Test

Cross-validated classifi cation

accuracy

Test 16: approximate vs. exact numerals 0.87 Test 3: nonimperative uses of imperative constructions 0.84 Test 20: aspect (sentence completion) 0.84 Test 31: participles (elicited imitation) 0.84 Test 22: verbs of motion (restricted control) 0.82 Test 33: subordinate clauses (elicited imitation) 0.79 Test 6: collocations/proverbs 0.79 Test 2: derivational morphology (lexical decision) 0.76 Test 5: basic lexicon (verbs) 0.76 Test 7: verbal prefi xes 0.76 Test 24: numerals (English to Russian translation) 0.76 Test 25: verbal adverbs (English to Russian translation) 0.76 Test 28: verbs of motion (sentence completion) 0.76 Test 32: verbal adverbs (elicited imitation) 0.76 Test 4: basic lexicon (nouns) 0.74 Test 10: phonological priming (lexical decision) 0.74 Test 18: refl exives (grammaticality judgment) 0.74 Test 27: sentence collocations (English to Russian translation) 0.74 Test 8: limited pro-drop (grammaticality judgment) 0.71 Test 13: phonemic perception (hard/soft and voicing) 0.71 Test 17: verbal adverbs 0.71 Test 19: indefi nite pronouns (grammaticality judgment) 0.71 Test 34: perception of speech in noise 0.71

Note . ILR = Interagency Language Roundtable.

Michael H. Long, Kira Gor, and Scott Jackson120

often characterized as being almost 3 rather than representing some well-defi ned position between 2 and 3 , so the level of success in classifi cation by these measures is interesting and somewhat surprising. Additionally, it is interesting that the set of measures that discriminates best between 2 and 2+ is distinct from the set that discriminates 2+ and 3 . These results, at this level of granularity, make a strong case for the viability of this kind of research program, designed to fl esh out the fi ne linguistic details of progression through a profi ciency scale such as the ILR, and the potential utility for such measures to augment current testing paradigms.

Despite the robustness of these fi ndings under a cross-validated analysis, it should be noted that, given the relatively small sample size and the fact that the large selection of measures may allow for capitali-zation on variance—which may be unique to this sample—the study is careful not to overinterpret these results. There is clearly strong potential to add valuable information to what is known about learners’ abilities at different ILR levels, and it is clear that even a relatively small number of such measures can effectively discriminate learner groups. To be more confi dent that the particular measures that are successful

Table 6. Tests that classifi ed ILR 2 vs. ILR 2+ at better than 0.70 accuracy

Test Cross-validated

classifi cation accuracy

Test 6: collocations/proverbs 0.74 Test 4: basic lexicon (nouns) 0.71 Test 18: refl exives (grammaticality judgment) 0.71

Table 7. Tests that classifi ed ILR 2+ vs. ILR 3 at better than 0.70 accuracy

Test

Cross-validated classifi cation

accuracy

Test 31: participles (elicited imitation) 0.86 Test 28: verbs of motion (sentence completion) 0.81 Test 10: phonological priming (lexical decision) 0.75 Test 24: numerals (English to Russian translation) 0.75 Test 25: verbal adverbs (English to Russian translation) 0.75 Test 6: collocations/proverbs 0.72 Test 9: verbs of motion (grammaticality judgment) 0.72 Test 30: phonology production (“hushers” and “y” vowel) 0.72 Test 33: subordinate clauses (elicited imitation) 0.72

Linguistic Correlates of L2 Profi ciency 121

in this study represent the most stable and reliable discriminators at the linguistic construct level or at the task-type level, further replica-tion and validation is needed.

DISCUSSION AND CONCLUSIONS

Having been refi ned by item analyses and other operations since the fi rst round of data collection and analysis, 32 of the 33 tasks in the data-collection battery (18 in perception and 14 in production) signifi cantly differentiate ILR profi ciency levels 2 and 3 . This is the case despite the truncated global profi ciency range in focus. This suggests that the data-collection battery is sensitive and psychometrically sound. For percep-tion, 9 tasks also differentiate levels 2 and 2+ , and 7 differentiate 2+ and 3 . In production, 3 tasks also differentiate levels 2 and 2+ , and 13 differen-tiate levels 2+ and 3 .

Most learners performed better on almost all perception tasks than they did on production tasks. Perception exceeds production at the levels studied, advanced and beyond, and thus refl ects the general tendency for production to lag behind perception in both L1 and L2

Table 9. Number of actual vs. predicted ILR levels for the best LDA model predicting ILR 2 vs. 2+ , with the predictors of tests 6 and 18 (collocations/proverbs, refl exives)

Actual ILR level

Predicted ILR level

2 2 +

2 16 4 2 + 5 13 Cross-validated classifi cation accuracy 0.76

Table 8. Number of actual vs. predicted ILR levels for the best LDA model predicting ILR 2 vs. 3 , with predictors of tests 6, 7, and 31 (collocations/proverbs, verbal prefi xes, participles)

Actual ILR level

Predicted ILR level

2 3

2 20 0 3 3 15 Cross-validated classifi cation accuracy 0.92

Michael H. Long, Kira Gor, and Scott Jackson122

acquisition. For some participants, this tendency was reinforced by the way they had learned Russian and the tasks for which most use the language in their work. This is true in all domains—phonological, mor-phological, syntactic, lexical, and collocational. Although steady improvement has been shown from ILR 2 to 2+ , and from 2+ to 3 , level 3 learners still have diffi culty with many of the linguistic features assessed in the study.

Additionally, this study supports the idea that acquisition of the fea-tures investigated here is a gradual process relative to the advance-ment through profi ciency, and that particular features do not belong to particular ILR levels. Figures 1 and 2 show the distributions of scores across ILR levels for a selection of perception and production tests, respectively. The histograms show rates of correct responses (ex-pressed as log odds) across participants. Improvements in performance are seen as the distribution shifts rightward, which indicates that more people are achieving higher odds of correct responses. The tasks shown here are representative of the largest effect sizes between profi ciency levels, across a mix of domains (morphology, syntax, and lexicon). These distributions are somewhat sparse, given the 18–20 participants in each of the ILR levels. It is clear from these fi gures that in the majority of tasks, there is not an abrupt shift in performance between profi ciency levels, but rather a more gradual shift in the distribution of scores across the range. Even though development appears gradual, the LDAs demonstrate that changes in performance can still act as effective dis-criminators. These results are inconsistent with the idea that a given linguistic feature belongs to a particular level on a profi ciency scale, but they confi rm the general claim that relative performance on targeted tasks of the type employed in this study track very well with progres-sion across ILR profi ciency levels.

On the basis of the results of the study, checklists of linguistic fea-tures for L2 learners that match typical levels of control of those fea-tures, measured as percentages accurate, to profi ciency levels on the ILR scale are available for the fi rst time. That is, for the wide range of

Table 10. Number of actual vs. predicted ILR levels for the best LDA model predicting ILR 2+ vs. 3 , with the predictors of tests 3 and 31 (nonimperative imperative constructions, participles)

Actual ILR level

Predicted ILR level

2 + 3

2 + 16 2 3 5 13 Cross-validated classifi cation accuracy 0.81

Linguistic Correlates of L2 Profi ciency 123

features investigated in this study, it is now known which are controlled by L2 Russian learners, and to what degree, as learners advance from level 2 to 2+ , and from 2+ to 3 on the ILR scale. It has also been demon-strated that changes in performance in many of these measures are cor-related suffi ciently with the construct of general (speaking) profi ciency that they can act as effective predictors of profi ciency level.

The checklists, in addition to helping learners (both in groups and as individuals), will be useful to syllabus designers, materials writers, classroom teachers, and, eventually, testers in language programs. To the best of the authors’ knowledge, this is the fi rst time such informa-tion has been established for any language, including ESL, let alone at relatively advanced profi ciency levels in a LCTL. A note of caution, how-ever, is appropriate here. To examine language-specifi c features along-side a scale that is intentionally language generic is a double-edged sword. It provides, on the one hand, much-needed detail to otherwise uninformative statements about diffi cult or complex structures; on the other hand, it introduces the possibility that the results will be irrele-vant for other languages. To extend this research into other languages and to pursue crosslinguistic comparison and generalization would be

Figure 1. Histograms of proportion of correct responses (expressed as log odds) by ILR level, for selected perception tests.

Figure 2. Histograms of proportion of correct responses (expressed as log odds) by ILR level, for selected production tests.

Michael H. Long, Kira Gor, and Scott Jackson124

of obvious value; nevertheless, language-specifi c features constitute a starting point for such a program.

Additionally, if the kind of information identifi ed by this study could be provided as part of feedback on a diagnostic test, it could potentially be provided to the learners themselves. To illustrate with ESL, hypothet-ical examples and typical errors represent what the individual learner can do: 1. ILR 2 : Can use subject, direct object, and indirect object relative clauses with

90% accuracy but has diffi culty with 50% of object of preposition relative clauses.

2. ILR 2+ : Can deal with high noise level only in supporting contexts, at 75% accuracy.

3. ILR 3 : Inverts subject and verb correctly 80% of the time in utterances begin-ning with simple adverbials expressing negative polarity (e.g., Never before had she heard someone with an accent like that ).

4. ILR 3+ : Inverts subject and verb correctly 100% of the time in utterances be-ginning with simple adverbials expressing negative polarity, and 70% of the time following complex adverbials (e.g., Only after checking with his super-visor did he agree to the request ).

The ability to obtain such results for Russian, plus the experience ob-

tained in the process about the optimal procedures and methodology—especially instrumentation—for doing so, suggests that similar information could now be secured fairly quickly for a broader range of profi ciency levels, with any of the commonly used scales, and in a variety of other languages of interest in the occupational and vocational sectors, as well as in academia. Such information would render the profi ciency scales in question both considerably more informative, and more informative to a wider range of end-users.

( Received 8 November 2010 )

NOTES

1. All italicized words or phrases indicate authors’ emphasis and are used to indicate ambiguous, opaque, and otherwise problematic elements.

2. Pragmatic abilities might also be included in the defi nition of LCPs. They were ignored in the present study for a variety of reasons, not least the labor and costs involved in developing acceptable measures of pragmatic abilities in a new language for which adequate descriptions of pragmatic abilities were lacking.

3. Research by Kanno et al. ( 2007 ) and Lee et al. ( 2005 ), respectively, found the pro-fi les of heritage and nonheritage learners of Japanese and Korean at the ILR 2–3 level to differ signifi cantly (e.g., in the lexical domain, in the proportion of appropriately used kango ), and also to differ within heritage groups according to such factors as whether or not they had attended Mombusho-supported schools in Honolulu or summer programs in Korea, perhaps refl ecting a focus on form in the elaborated code of schooling not experi-enced by pure “kitchen-Japanese/Korean” heritage learners.

Linguistic Correlates of L2 Profi ciency 125

4. Lexical frequency was established on the basis of Sharoff’s Corpus (Russian Inter-net Corpus, approximately 90 million words at the time of use), which can be found at http :// corpus . leeds . ac . uk / ruscorpora . html .

5. The full list of tasks in the instrumentation study, the sequence of presentation, the approximate duration of each task, and results for each are provided in Appendix R, tables R12, R13, and R14 of the technical report (Long et al., 2006 ).

6. A complete list of the tasks along with sample test items is available on request from the second author at [email protected] .

7. As with the original instrumentation study, the test batteries were presented using the DMDX software package (Forster & Forster, 2003 ). This software was chosen for the LCP project for several reasons. First, it is extremely powerful and fl exible, allowing devel-opers to present a wide variety of stimuli (including pictures and graphics, text, and sound fi les) and gather a wide variety of data types (including recorded speech, button-press responses from various input devices, and reaction times). The software was devel-oped with particular emphasis on gathering precise reaction times, which is critical in tasks such as lexical decision. Many of the tasks used an ergonomic game controller (Logitech® Precision™ Gamepad), as such controllers are known to gather relatively pre-cise reaction time information and are very economical. The second reason for the choice of DMDX was that the software is free, providing an economic solution to the needs of the project. It was considered that if the research eventually warranted wide distribution of the test batteries, the use of free software would make distribution much easier by reducing the economic burden on end users.

8. As expected in the kinds of repeated-measures crossed designs of most of the tasks, random effects of both subject and item on the model intercept nearly always contributed signifi cantly to the models, but random effects on slopes contributed only very rarely.

REFERENCES

Agresti , A. ( 2002 ). Categorical data analysis ( 2nd ed. ). New York : Wiley . Alderson , J. C . ( 2005 ). Diagnosing foreign language profi ciency: The interface between

learning and assessment . New York : Continuum . Alderson , J. C . ( 2007 ). The CEFR and the need for more research . Modern Language

Journal , 91 , 659 –63. Alderson , J. C. , & Huhta , A. ( 2005 ). The development of a suite of computer-based diagnos-

tic tests based on the Common European Framework . Language Testing , 22 , 301 – 320 . American Council for the Teaching of Foreign Languages . ( 1985 ). ACTFL Profi ciency Guide-

lines ( Rev. ed.). Hastings-on-Hudson, NY : ACTFL Materials Center . Bachman , L. F . ( 1988 ). Problems in examining the validity of the ACTFL Oral Profi ciency

Interview . Studies in Second Language Acquisition , 10 , 149 – 164 . Bates , D. M. , & Sarkar , D. ( 2007 ). lme4: Linear mixed-effects models using S4 classes

(R package version 0.999375-28) [Computer software] . Brecht , R. , & Rivers , W. ( 2000 ). Language and national security for the 21st century: The role of

Title VI/Fulbright Hays in supporting national language capacity . Dubuque, IA : Kendall/Hunt . Brecht , R. , & Rivers , W. ( 2005 ). Language needs analysis at the societal level . In M. H. Long (Ed.),

Second language needs assessment (pp. 79 – 104 ). New York : Cambridge University Press . Council of Europe . ( 2001 ). Common European framework of reference for languages:

Learning, teaching, and assessment . New York : Cambridge University Press . Council of Europe . (n.d.). Language policy . Retrieved December 2, 2006, from http :// www .

coe . int . lang . Forster , K. I. , & Forster , J. C . ( 2003 ). DMDX: A Windows display program with millisecond

accuracy . Behavioral Research Methods, Instruments, & Computers , 35 , 116 – 124 . Fulcher , G. ( 1996 ). Invalidating validity claims for the ACTFL oral rating scale . System , 24 ,

163 – 172 . Fulcher , G. ( 2004 ). Deluded by artifi ces? The Common European Framework and harmoni-

zation . Language Assessment Quarterly , 1 , 253 – 266 . Higgs , T. V . ( 1984 ). Teaching for profi ciency: The organizing principle . Lincolnwood, IL :

National Textbook . Hulstijn , J. H . ( 2007 ). The shaky ground beneath the CEFR: Quantitative and qualitative

dimensions of language profi ciency . Modern Language Journal , 91 , 663 – 667 .

Michael H. Long, Kira Gor, and Scott Jackson126

Hulstijn , J. H. , & Schoonen , R. ( 2006 , February). Scientifi c report of ESF-sponsored Exploratory Workshop: Bridging the gap between research on second language acquisition and re-search on language testing . ( European Science Foundation Report No. EW05-208-SCH). Retrieved September 9, 2007, from http :// www . esf . org / index . php ? eID = tx_nawsecuredl & u = 0 & file = fileadmin / be_user / ew_docs / 05 - 208_Report . pdf & t = 1313602831 & hash = 1257bb8f0d57cdeec5c605b8d61527ae .

Interagency Language Roundtable . (n.d.). ILR speaking skill scale . Retrieved February, 25, 2010, from http :// www . govtilr . org / Skills / ILRscale2 . htm .

Jaeger , T. F . ( 2008 ). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models . Journal of Memory and Language , 59 , 434 – 446 .

Kanno , K. , Hasegawa , T. , Ikeda , K. , Ito , Y. , & Long , M. H . ( 2007 ). Relationships between prior language-learning experience and variation in the linguistic profi les of advanced English-speaking learners of Japanese . In D. Brinton & O. Kagan (Eds.), Heritage language: A new fi eld emerging (pp. 165 – 180 ). Mahwah, NJ : Erlbaum .

Kunnan , A. J. , & Jang , E. E . ( 2009 ). Diagnostic feedback in language assessment . In M. H. Long & C. J. Doughty (Eds.), Handbook of second and foreign language teaching (pp. 610 – 627 ). Oxford : Blackwell .

Lantolf , J. P. , & Frawley , W. ( 1985 ). Oral profi ciency testing: A critical analysis . Modern Language Journal , 69 , 337 – 345 .

Lantolf , J. P. , & Frawley , W. ( 1988 ). Profi ciency: Understanding the construct . Studies in Second Language Acquisition , 10 , 181 – 195 .

Lantolf , J. P. , & Frawley , W. ( 1992 ). Rejecting the OPI—again: A response to Hagen . ADFL Bulletin , 23 , 34 – 37 .

Lee , Y.-G. , Kim , H.-S. H. , Kong , D.-K. , Hong , J.-M. , & Long , M. H . ( 2005 ). Variation in the lin-guistic profi les of advanced English-speaking learners of Korean . Language Research , 41 , 437 – 456 .

Lett , J. A . ( 2005 ). Foreign language needs assessment in the US military . In M. H. Long (Ed.), Second language needs analysis (pp. 105 – 124 ). New York : Cambridge University Press .

Long , M. H . ( 1991 ). Focus on form: A design feature in language teaching methodology . In K. de Bot , R. B. Ginsberg , & C. Kramsch (Eds.), Foreign language research in cross-cultural perspective (pp. 39 – 52 ). Amsterdam : Benjamins .

Long , M. H . ( 2007 ). Problems in SLA . Mahwah, NJ : Erlbaum . Long , M. H . ( 2009 ). Methodological principles for language teaching . In M. H. Long & C. J.

Doughty (Eds.), Handbook of language teaching (pp. 373 – 394 ). Oxford : Blackwell . Long , M. H. , Jackson , S. , Aquil , R. , Cagri , I. , Gor , K. , & Lee , S.-Y. ( 2006 ). Linguistic Correlates

of Profi ciency: Rationale, Methodology, and Content (Technical report) . College Park : University of Maryland .

Musumeci , D. ( 2009 ). History of language teaching . In M. H. Long & C. J. Doughty (Eds.), Handbook of language teaching (pp. 373 – 394 ). Oxford : Blackwell .

North , B. ( 2000 ). Linking language assessments: An example in a low stakes context . System , 28 , 555 – 577 .

North , B. , & Schneider , G. ( 1998 ). Scaling descriptors for language profi ciency scales . Language Testing , 15 , 217 – 263 .

Pawlikowska-Smith , G. ( 2000 ). Canadian language benchmarks 2000: English as a second language for adults . Ottowa : Citizenship and Immigration Canada .

Pienemann , M. ( 1985 ). Learnability and syllabus construction . In K. Hyltenstam & M. Pienemann (Eds.), Modelling and assessing second language acquisition (pp. 23 – 75 ). Bristol, UK : Multilingual Matters .

Pienemann , M. , Johnston , M. , & Brindley , G. ( 1988 ). Constructing an acquisition-based procedure for second language assessment . Studies in Second Language Acquisition , 10 , 217 – 243 .

Pinheiro , J. C. , & Bates , D. M . ( 2000 ). Mixed-effects models in S and S-PLUS . New York : Springer Verlag .

R Development Core Team . ( 2008 ). R: A language and environment for statistical computing . Vienna, Austria : R Foundation for Statistical Computing . ISBN 3-900051-07-0, URL http :// www . R - project . org .

Wylie , E. , & Ingram , D. E . ( 1999 ). International Second Language Profi ciency Ratings (ISLPR): General Profi ciency Version for English ( Rev. ed.). Brisbane : Center for Applied Linguistics and Languages, Griffi th University .