nikolay gochev

Criterion-related Validity of Strong and Weak Second Language

Performance Assessments in a University Pathway Programme

MA in Language Testing (by Distance)

Nikolay Gochev

December 2013

Word Count: 16,322

2

Abstract:

The study investigates the predictive validity of three second language performance

assessments in a foundation programme provided by a private college in partnership

with a UK university. Adopting McNamara‟s (1996) distinction between weak and

strong second language performance assessments, the study examines the relationship

between the Final Academic Score awarded at the end of the foundation programme

and language proficiency as measured by (a) two weak performance assessments:

IELTS and Language for Study 1 (an in-house language test), and (b) a strong

performance assessment: Skills for Study 1 (an end-of-course assessment). Spearman

rank order correlation coefficients ( ) were calculated for the three sets of scores.

Significant positive correlations were found between the Final Academic Score and

both the strong and weak second language performance assessment scores and sub-

scores. However, the findings suggest that the strong performance assessment is a

better predictor of students‟ academic achievement at the College than either of the

weak performance assessments. Possible reasons for the findings are discussed in the

light of the theoretical insights from the literature on second language performance

assessment. The paper argues that the main factors which contribute to the better

predictive power of the strong performance assessment are the task/context approach

to construct definition, the use of indigenous marking criteria, and the untimed,

process-oriented nature of the assessment. Limitations and implications for future

research are also discussed.

3

Contents

1. Introduction…………………………………………………………………….5

2. Background…………………………………………………………………….7

2.1 Literature Review …………………………………………………….........7

2.1.1 Performance assessment……………………………………………7

2.1.2 Strong and weak second language performance assessments……...8

2.1.3 The construct in EAP……………………………………………..11

2.1.4 Untimed out-of-class and timed in-class assessments………….....14

2.1.5 Product-oriented and process-oriented assessments.......................16

2.1.6 Criterion-related validity……………………………………….....17

2.1.7 Predictive validity of large-scale EAP tests…………………........19

2.1.8 Predictive validity of small-scale EAP assessments………….......21

2.1.9 Summary of the literature review ………………………………...22

2.2 Research context……………………………………………………….....23

2.3 Research questions………………………………………………………..26

3. Methodology………………………………………………………………….27

3.1 Participants………………………………………………………………..27

3.2 Instruments……………………………………………………………......27

3.2.1 International English Language Testing System (IETLS)……......27

3.2.2 Language for Study 1 (LS1)………………………………………27

3.2.3 Skills for Study 1 (SS1)…………………………………………...28

3.2.4 Final Academic Score…………………………………………….29

3.2.5 Final Academic Score minus SS1 ………………………………...30

3.3 Data Collection and Procedures…………………………………………..30

3.3.1 Data collection…………………………………………………….30

3.3.2 Procedures ………………………………………………………...32

4. Results ………………………………………………………………………...34

4.1 IELTS & Final Academic Score ………………………………………...34

4.1.1 Box plots – outliers and extreme………………………………….34

4.1.2 Scatterplots………………………………………………………..34

4.1.3 Descriptive statistics………………………………………………35

4.1.4 Q-Q plots and test of normality…………………………………...36

4.1.5 Non-parametric correlations………………………………………36

4.2 LS1 & Final Academic Score……………………………………………..38

4.2.1 Box plots – outliers and extreme cases…………………………...38

4.2.2 Scatterplots………………………………………………………..38


4.2.4 Q-Q plots………………………………………………………….40


4.3 SS1 and Final minus SS1…………………………………………………41

4.3.1 Box plots – outliers and extreme cases…………………………...41

4.3.2 Scatterplots………………………………………………………..42


4

4.3.4 Q-Q plots and test of normality…………………………………...43


5. Discussion of results…………………………………………………………..45

5.1 Research question 1……………………………………………………….45

5.2 Research question 2……………………………………………………….49

5.3 Research question 3…………………………………………………….....52

6. Conclusions and implications for future research…………………………….55

7. References…………………………………………………………………….59

8. Appendices……………………………………………………………………66

Appendix I: Sample LS1 Reading & Writing Exam………………….......66

Appendix II: Sample LS1 Listening & Speaking Exam…………………...69

Appendix III: Sample SS1 Essay…………………………………………...72

Appendix IV: Sample SS1 Presentation Task………………………………74

Appendix V: Box-and-whisker plots of frequency distribution………........75

output IELTS Overall Score and Final Academic

Score

Appendix VI: Histograms of IELTS and Final Academic Score…………...76

Distributions

Appendix VII: Normal and Detrended Q-Q plots of IELTS Overall………..77

and Final Academic Score

Appendix VIII: Box-and-whisker plots of frequency distribution output:…...79

LS1 Reported Overall Score and Final Academic Score

Appendix IX: Normal and Detrended Q-Q plots of LS1 Reported,………...80

LS1 R&W, LS1 S&L, and Final Academic Score

Appendix X: Box-and-whisker plots of frequency distribution………........84

output: SS1 Overall Score and Final minus SS1

(F-SS1)

Appendix XI: Normal and Detrended Q-Q plots of SS1 and……………….85

Final minus SS1

5

1. INTRODUCTION

The UK is a popular study destination for international students. In 2011-2012,

435,230 international students were enrolled in UK universities (UK Council for

International Student Affairs [UKCISA], 2013). Several routes into UK higher

education are available to overseas students. Some students qualify for direct entry on

a par with British students, for example EU nationals whose qualifications are

equivalent to A levels, or students who hold the International Baccalaureate (IB),

European Baccalaureate (EB) and Irish Leaving Certificate. Others can gain access to

higher education via a range of pathway programmes such as Foundation Years or

Foundation courses (“Routes into university and higher education”, 2013). In addition

to having appropriate academic qualifications, international students must satisfy

English language entry requirements. The UK Border Agency [UKBA] (2013)

specifies a minimum level of English language proficiency for university study of

CEFR B2 and provides a list of approved English language tests. Furthermore, UK

universities set their own English language entry requirements, which range from

IELTS 5.5 to 7, or equivalent, for undergraduates (Hayatt & Brooks, 2009, p.14).

Students who do not meet either the academic or language requirements have the

option of enrolling in Foundation Programmes provided either by universities or

private colleges, which typically include academic English, study skills, subject-

specific and IT courses (Reinders, Moore, & Lewis, 2008, p. 11) aimed at improving

students‟ academic qualifications and English language ability.

Undoubtedly, language assessments play a significant part in such programmes.

Although a number of language proficiency tests such as IELTS and TOEFL are used

for screening purposes (Howell et al, 2012), some authors have argued that their use as

a measure of achievement in English for Academic Purposes (EAP) courses may be

invalid (Alderson, 2000; Banerjee & Wall, 2006). Schmitt (2012) points out that the

aim of EAP courses in pre-sessional and foundation programmes is broader than

simply language teaching. Such courses are specifically tailored to the requirements of

the academic target language use situation and therefore the assessments need to focus

on measuring achievement rather than proficiency. Thus it could be argued that to

justify the use of a particular assessment instrument, data must be gathered to provide

evidence for the validity of the inferences drawn from that assessment. Both

proficiency and achievement EAP tests claim to assess students‟ readiness for

6

university study. Therefore, one possible source of evidence is correlational research

into the predictive validity of language assessments.

This dissertation begins with a Background section, which presents an overview of the

literature on second language performance assessment. It outlines the distinction

between strong and weak performance assessments based on their approach to

construct definition and the criteria used to evaluate test-takers‟ performance. The

construct in English for Academic Purposes testing is discussed, followed by two

important aspects that impact on second language performance assessment, namely

time and process-orientedness. After that, the importance of criterion-related validity

within Messick‟s (1980) concept of validity as a unitary concept is discussed, and the

results of previous predictive validity studies evaluated. Finally, a rationale for the

current study is presented.

The second part of the Background section describes the research context and

formulates three research questions which the study aims to answer. Section 3 focuses

on the methodology and provides information about the participants of the study, the

assessment instruments used, the data collection methods and data analysis

procedures. Section 4 introduces the rationale for selecting the appropriate correlation

coefficients for this study and reports the results of the correlational analysis. Section 5

attempts to answer the three research questions by analysing the extent to which the

results from the study support the theory behind weak and strong performance

assessments outlined in the literature review. The dissertation ends with a conclusion

and suggestions for future research.

7

2. BACKGROUND

2.1 Literature review

2.1.1 Performance assessment

A performance test is „a test in which the ability of candidates to perform particular

tasks, usually associated with job or study requirements, is assessed‟ (Davies et al,

1999, p.144). The need for performance assessment arose in the 1960s for two main

reasons. The first one, as the definition above implies, was to evaluate the language

proficiency of overseas professionals, such as doctors and teachers, who sought

employment in English-speaking countries, and of foreign students who wished to

study in English-medium universities. The second reason was that language testing

had to keep up with the developments in the classroom, which began to emphasise the

teaching of communicative language ability (McNamara, 1996, p. 25). As

Wigglesworth (2008) points out, compared with the traditional discrete item tests,

performance based assessments aim to evaluate examinees on a wide range of

language abilities. However, this assessment procedure introduced two factors missing

from the traditional tests, namely performance on the part of the test-taker and a

judging process used to evaluate that performance (McNamara, 1996, p. 10). Figure 1

(adapted from McNamara, 1996, p.9 and Skehan, 1998, p.170) sums up the

characteristics of performance assessment.

RATER

↓

SCALE RATING (SCORE)

↓

PERFORMANCE

↑

INSTRUMENT

↑

CANDIDATE

↑

UNDERLYING COMPETENCES

Figure 1. The characteristics of performance assessment

8

This model illustrates that a number of factors and the interactions between them may

affect the rating of a candidate and consequently the decision taken on the basis of the

test (Skehan, 1998, p.169). As the aim of assessment is to provide „an adequate and

accurate basis for making trustworthy interpretations‟ (Norris, 2002), I will next

consider the role of the assessment instruments, the rating criteria and the context in

which they are used in achieving that objective.

2.1.2 Strong and weak second language performance assessments

Performance assessments are often described as direct, authentic and realistic,

designed to elicit and test real-world skills (McNamara, 1996; Bachman, 2007; Brown,

2004). However, McNamara (2006, p.43) argues that such features are often a matter

of degree and that actual tests will most likely be positioned somewhere along the

continuum from „strong‟ to „weak‟ second language performance assessments.

In the „strong‟ sense, real-world tasks and criteria are used and the target of

assessment is the execution of the task itself. Any assessment of language ability is

incidental in as much as it contributes to the successful completion of the task

(McNamara, 1996, p. 43). As McNamara (1996, p.43) puts it, „adequate second

language proficiency is a necessary but not a sufficient condition for success on the

performance task‟. Brown, Hudson, Norris and Bonk (2002) use the term „task-based

language assessment‟ to describe strong performance assessments, defining the

construct of interest as the ability to „accomplish particular tasks or task types‟. In his

discussion of construct definition in language testing, Bachman (2007, p.42) also

characterises the construct of strong performance tests as task/context focused.

Moreover, Skehan (1998, p.156) observes that such tests represent „direct methods of

gathering data‟ with little emphasis on underlying abilities, but with higher success

rate in predicting performance in the real world. This success, according to Skehan, is

due to a large extent to the fact that the „real life approach‟ restricts itself to making

predictions in specific, clearly defined situations.

Task-based performance is assessed by employing rating scales. Douglas (2000, p.68)

refers to the criteria used by specialists in the real world as „indigenous assessment

criteria‟ and argues that such criteria should be used as the basis for constructing the

9

rating scales of specific purpose language tests. Moreover, since in the „strong‟ sense

successful completion of the task is the criterion according to which performance is

evaluated, non-linguistic factors are seen as contributing to rather than diminishing the

authenticity of the test. Thus, it is conceivable that lower general language proficiency

may not result in failure for a candidate who has other relevant skills, while a highly

proficient candidate may actually fail a strong performance test if they lack those

relevant skills (Jones, 1985, cited in McNamara, 1996, p.39). Furthermore, Brindley

(2009) points out that outside applied linguistics successful communication is often

evaluated by applying performance criteria such as „empathy, behavioural flexibility

and interaction management‟ and that language testers may need to consider such

criteria in performance assessment. Overall, strong performance assessments place

less emphasis on language skills and focus more on task completion whose success is

judged on the basis of indigenous criteria derived from the target language use

situation.

In the „weak‟ sense, the tasks of second language performance tests only resemble or

simulate real-life tasks, but the actual focus is on „language performance‟ (McNamara,

1996, p. 44). In other words, the tasks are the means through which an adequate

sample of language is elicited for the purposes of determining the candidates‟ level of

proficiency. This chimes in with Long and Norris‟s (2009, p.138) observations that in

some tests tasks are used simply „as a means for eliciting particular components of the

language system which are then measured and evaluated‟. Bachman (2007, p.42)

defines the construct of weak performance tests as „trait/ability focused‟, i.e. as „an

ability or capacity that resides in the individual‟ and therefore performance on the test

allows for inferences to be made about a test-taker‟s „underlying ability for language

use‟. McNamara (1996, p.44) points out that while weak performance tests may

appear to mirror strong performance tests, such resemblance is largely superficial.

They are mainly used because of their assumed face validity, positive washback and as

a way of offsetting two major disadvantages of strong performance tests, namely time

and expense. McNamara (1996, p.44) notes that most performance based general

proficiency tests are of the weak type and even EAP tests such as IELTS are not strong

performance tests. This view is supported by Brown (2004) who states that

composition writing and oral interviews which are evaluated on the basis of linguistic

criteria like accuracy, range and fluency are weak performance tests. A possible

10

consequence is that although the tasks that comprise weak performance tests like

IELTS may resemble the tasks that students undertake at university, some students,

who fail these simulated tasks because of low proficiency in English, are likely to

adopt a range of strategies in the real world that will allow them to complete academic

assignments successfully (McNamara, 1996, p. 42).

On the other hand, according to Bachman (2007, p.56), what distinguishes these two

approaches is not so much the types of assessment tasks used but the inferences which

are drawn from the test scores. The „trait/ability‟ approach is used to support

inferences about the abilities of the examinees, while the „task/context‟ approach is

mainly concerned with „predictions about future performance on real-world tasks‟.

Figure 2 illustrates these two different views of language proficiency together with the

possible interpretations of test results.

Theoretical

view of

language ability

Language proficiency as

pragmatic ascription

(„Able to do X‟)

Language proficiency as

theoretical construct

(„Has ability X‟)

↓ ↓

View of test

Test as a pragmatic

prediction device

Test as an indicator

of ability

↓ ↓

Test use

Use for prediction of

future behaviour

Interpretation as

level of ability

↓ ↓

Evidential

support

Predictive utility

Construct validity

Figure 2. Contrasting views of language proficiency, uses/interpretations of test

results, and validity (Bachman, 1990, p.254)

According to Bachman (1990, pp.253-254) evidence of „predictability does not

constitute evidence for making inferences about ability.‟ In other words, the inference

11

„A is able to do X‟ therefore „A has ability Y‟ is not valid. What is not clear, however,

is whether the inference „A has ability Y‟ therefore „A is able to do X‟ is a reasonable

one to make. But this is exactly what weak performance assessments seem to be used

for. Alderson, Clapham, and Wall (1995, pp.180-181) point out that proficiency tests

like IELTS and TOEFL are used to make predictions about the future performance of

students in English-medium universities. A strong argument against such an inference

is the fact that native speakers of English do not necessarily perform better at

university than international students, who are presumably less proficient (Dooey &

Oliver, 2002; Cho & Bridgeman, 2012). This raises questions about the fairness of the

gate-keeping function of weak performance assessments like IELTS, namely the

extent to which cut-off scores on weak performance tests can be justified without

recourse to scores from strong performance tests. If these two approaches support

different inferences about the test-takers, it is arguable that no important decision

should be taken on the basis of only one of them. Such a view is expounded by a

number of researchers who emphasise that inferences about test-takers should be

drawn from a range of sources to ensure fairness and accountability in language

testing (Dooey & Oliver, 2002; O‟Loughlin, 2011; Rees, 1999; Shohamy, 2001).

2.1.3 The construct in EAP

Evidently, the way the construct of interest is defined affects the inferences that can be

drawn from test results. As Skehan (1998, p.153) observes the development of weak

and strong performance assessments came from a conflict between three underlying

issues in language testing: inferring ability, predicting performance, and generalising

from context to context. Performance assessment in EAP contexts inevitably has to

address these issues. Arguably, the most important question is: what construct should

EAP tests measure? One common approach to defining the construct is analysis of the

target language use situation (McNamara, 1997; Douglas, 2000), which invariably

raises the issue of academic skills. Jordan (1997, p.7) provides a comprehensive list of

skills that are essential for both native and non-native speaker students. In addition to

the four skills of listening, speaking, reading and writing for academic purposes,

emphasis is also placed on research and reference skills, i.e. effective library use,

finding and analysing evidence and data, summarising, paraphrasing and using

quotations and in-text citations. Although these skills are considered transferrable if

12

the students already possess them, such an assumption cannot be made automatically

as students arrive in English-speaking countries with a range of previous educational

experiences (Jordan, 1997, p.5). This means that the objectives of foundation courses

for international students typically include the development of both language

proficiency and adequate study skills. In terms of assessment, then, achievement tests

whose marking criteria reflect the real-life skills necessary for the academic

environment in the UK are more likely to provide an accurate measure of readiness to

study in an English-medium university than a more general academic English

proficiency test. As Banerjee and Wall (2006) point out using external EAP tests such

as IELTS „would risk construct under-representativeness‟ and therefore reduce the

usefulness of the scores for the stakeholders, namely the various university

departments and the students themselves.

Indeed, some authors (Green, 2007; Moore & Morton, 2007) have argued that the

IELTS writing construct does not reflect academic writing at university. For example,

Moore and Morton (2007) identify three main areas of discrepancy between IELTS

writing task 2 and writing tasks in higher education. First, in IELTS essay tasks the

test-takers are expected to rely on their personal experience and background

knowledge. However, in higher education settings students are expected to make use

of published sources, which they need to be able to paraphrase, summarise and cite in

order to support the arguments put forward in their course assignments. Second,

IELTS tasks do not include a variety of text types and genres, nor do they include the

same range of rhetorical functions as academic tasks. Third, IELTS tasks are primarily

phenomenal in nature as they focus on „real-world […] situations, actions, practices‟,

while university tasks can be phenomenal and metaphenomenal, i.e. emphasising

„abstract […] theories, ideas, methods‟ (Moore & Morton, 2007).

In an attempt to address such construct under-representativeness in large-scale EAP

examinations, many institutions have introduced integrated reading/writing,

listening/speaking, and reading/listening/writing tasks. Lewkowicz (1997) summarises

the advantages of integrated tasks: they are direct, add realism to assessments, and are

appropriate for homogeneous populations whose language use needs are clearly

identifiable as in university settings.

13

Reading-to-write tasks are described as authentic because in academic contexts writing

is always preceded by reading a wide variety of sources (Hamp-Lyons & Kroll, 1997;

Weigle, 2004; Plakans, 2008; Gebril & Plakans, 2013). Furthermore, an impromptu

unseen writing task is inauthentic because even if university students have to be able

to write mini-essays in response to exam questions, they will have prepared and

revised for that exam by reading the relevant coursebook and other assigned texts. A

second argument for the use of reading-to-write tasks is that they may reduce the

background knowledge effect as all candidates will have access to the same

information on which to base their arguments (Weigle, 2004). Studies of reading-to-

write tasks have shown that lower level students make less use of the source texts,

perhaps as a result of weaker reading comprehension skills (Gebril & Plakans, 2013);

that better inferences can be made about academic writing ability in terms of planning

and discourse synthesis than on the basis of writing-only tasks (Plakans, 2008); and

that preparing for a reading-to-write test has a positive washback as it enhances the

skills that students require for academic study (Weigle, 2004). Moreover, Weigle

(2002, p.178) argues that the construct of writing in academic contexts should include

the ability „to make use of all available resources‟ to communicate effectively with the

intended audience rather than the ability to create texts by simply relying on one‟s own

linguistic competence. She emphasises that although this „complicates the definition

and measurement of the construct enormously‟, it is a better reflection of what writers

do in the real world.

Integrated Listening-and-Speaking or Listening-and-Writing tasks require test-takers

to listen to a recording and then incorporate information from the listening text into

speech, for example by giving an oral summary of the content of the talk, or into an

essay, for example as support for their arguments. Such tasks are viewed as integral

part of academic study where students are frequently required to make use of multiple

sources, including lectures and podcasts (Taylor & Geranpayeh, 2011; Slaght &

Howell, 2007). As with reading-to-write tasks, this complicates the construct

definition as the task involves more than one skill. However, despite the drawback of

„muddied measurement‟ (Weir, 1990, cited in Lewkowicz, 1997, p.127), several

authors emphasise the increased authenticity of integrated tasks (Douglas, 2010;

Slaght & Howell, 2007). Indeed, based on analysis of the target language use domain

in higher education, several language testing bodies have introduced integrated tasks:

14

The Test of English for Education Purposes (TEEP) contains an integrated reading-

listening-writing task, where candidates use the reading and listening parts of the test

as sources of ideas for an academic essay (International Study and Language Institute,

2013); the Canadian Academic English Language (CAEL) Assessment, in which

students base their essays on three reading passages and one lecture on the same topic

(CAEL, 2013); and the speaking and writing sections of TOEFL have been designed

to integrate information from the reading and listening tasks (The TOEFL iBT Test,

2013).

In addition, since the assessment criteria which comprise rating scales „act as the de

facto test construct‟ (Knoch, 2011), in the EAP context they should simulate the

criteria applied by subject specialists in university settings. There is some evidence

that for subject tutors content takes priority over language. For example, a study by

Errey (2000) used introspective verbal reports to investigate the extent to which five

university lecturers from five different departments were influenced by language

errors in student essays. The analysis of the verbal reports showed that content and use

of sources were ranked as the most significant factors that lecturers took into account

when assessing students‟ work, followed by coherence and organisation. Out of 24

factors, grammatical accuracy and accurate vocabulary were ranked 10th

and 18th

respectively. However, the study also revealed inconsistencies in the ways lecturers

actually dealt with language errors, with some lecturers ignoring or tolerating them,

while others deducted marks for linguistic inaccuracies. This conclusion seems to

support an observation by Hartill (2000) that the linguistic quality of students‟ essays

is „left very much to the individual tutor‟s discretion‟ and that in many university

departments there is a tendency for less strict emphasis on grammatical accuracy.

2.1.4 Untimed out-of-class and timed in-class assessments

Another important aspect of performance assessments is the time allotted and the place

where the assessment is carried out. Weigle (2002, p.173) distinguishes between in-

class timed and out-of-class untimed writing assessments. The purpose of in-class

timed writing is to test whether students can plan and write an essay on their own

without recourse to external sources such as textbooks, dictionaries, or advice from

peers and tutors. On the other hand, out-of-class untimed writing in academic settings

aims to provide direct experience of the stages of the essay writing process such as

15

analysing the essay question, finding and evaluating sources, organising notes,

drafting, receiving feedback, redrafting and editing (Weigle, 2002, p.174). Thus the

completion of one extended writing assignment may take several weeks as opposed to

the 40 to 60 minutes normally allocated in in-class timed tests. Two studies in the

1990s compared student performance on in-class timed and out-of-class untimed

writing tasks. Caudery (1990) investigated the validity of timed essays, which, he felt,

did not measure the ability to write extended texts required in academic or

occupational settings. The study compares the performance of 24 secondary school

students preparing for GCE O-level examination in English Language in Cyprus. The

students produced two pieces of writing – a timed in-class essay written within a 40-

minute time limit, and an out-of-class untimed essay, which they started in class and

finished at home over a 2-day period. The study found no evidence that time affected

the overall scores of the students. Among the explanations given, the author mentions

the negative washback of preparing for timed writing and the age of the students, who,

in his view, lacked maturity to take advantage of the increased amount of time. He

speculates that a study involving adult students may yield different results. Kroll

(1990) investigated the performance of first-year undergraduate foreign students at an

American university on four writing tasks. Two of the tasks were timed in-class essays

written within a 60-minute time limit and the other two were take-home essay written

over a period of 10-14 days. This study also failed to find support for the hypothesis

that more time will result in statistically significant differences between the in-class

and out-of-class tasks, despite some improvement on the syntactic and rhetorical level.

The explanations offered echo the ones by Caudrey (1990), namely students may be

unaware of what „constitutes good writing‟ and therefore were unable to make good

use of the extra time. Moreover, Kroll (1990) points out that they had not been taught

about the process of writing and concludes that explicit instruction is required in order

to guide students in their development as competent academic writers. Therefore,

students are likely to benefit from out-of-class untimed writing only if they have

awareness of the academic genres relevant to their discipline and the most effective

methods to approach academic writing tasks.

16

2.1.5 Product-oriented and process-oriented assessments

It is clear from the preceding discussion that the way writing is taught and assessed in

EAP courses has direct relevance to student attainment at university. Traditionally, a

distinction has been made between the product and process approach to teaching

academic writing. Jordan (1997, p.164) describes the product approach as „concerned

with the finished product – the text‟. The central concept of this method is the

presentation of a model text whose characteristics are analysed with the aim of

producing a parallel text. The process of analysis also includes a range of language

functions such as explanation, definition, exemplification to name but a few. The

product approach has been used to familiarise students with the wide range of

academic genres they will encounter at university: „essays, reports, case studies,

projects, literature reviews, exam answers, research papers/articles, dissertations and

theses‟ (Jordan, 1997, pp. 165-166).

On the other hand, the process approach focuses on the composing process which

writers go through. It views writing as problem solving, which begins with a thinking

stage that involves generating, grouping together and selecting relevant ideas. This is

followed by the process stage that involves a cycle of writing a draft, receiving peer or

tutor feedback, revising the draft and so on until satisfactory completion of the task

(Dudley-Evans & St John, 1998, p. 117).

In terms of assessment, the majority of large scale tests such as IELTS, TOEFL and

the Cambridge ESOL exams are product-oriented. Candidates are advised to plan and

check their essays but the short time limit and lack of feedback from peers or tutors

preclude revision, which means that such essays can only be considered first drafts

(Weigle, 2002, p. 176). There have been some attempts to introduce process-oriented

writing assessments. Weigle (2002, pp. 191-192) cites an example of a test developed

at Georgia State University, where students receive a grade on how well they have

edited and revised their final draft in the light of peer and tutor feedback. The grade

was added to the assessment criteria in order to encourage students to adopt the

process approach to writing, which according to Hartill (2000) is particularly

appropriate in higher education settings.

17

Another study by Cho (2003) examined student performance on two placement tests

based on either a product-oriented or a process-oriented approach to assessing writing.

The first is a timed integrated listening-reading-writing task, where the examinees

listen to a lecture, read an article and write an essay within a 70-minute time frame.

The second is a two-day test, which the author describes as a „work-shop based essay‟,

whose aim is to simulate as closely as possible the real-life conditions in which

academic writing is produced. On day one, the examinees listen to a lecture, read an

article, engage in discussions and produce a first draft. On day two, the students read

the drafts of two of their peers and give them feedback. After that they have 60

minutes to write their final draft taking into account the comments they received. Only

the final draft is marked. The results show that, overall, the workshop essays received

higher scores and 75% of the test-takers were placed in a higher level group. Also,

comparison of the textual quality of the two essays revealed that the workshop essays

ranked higher on all features excluding „source attribution‟. Organisation, content and

source use showed the most improvement, while linguistic features such as grammar,

vocabulary and sentence variety were least affected by the test method. These two

examples demonstrate that it is feasible to design both timed and untimed assessments

incorporating the process approach to writing, which is highly relevant in academic

contexts.

2.1.6 Criterion-related validity

The concept of validity in language testing has undergone a marked change since the

1960s from being viewed as a property of the test to the current understanding of

validity as a unitary concept (Chapelle, 1999). Messick (1980) defines validity as „an

overall evaluative judgment of the adequacy and appropriateness of inferences drawn

from test scores‟. Moreover, he argues that construct validity is „a unifying concept of

validity‟, which can be described as having two facets: (a) „the source of justification‟

of the testing and (b) „the function or outcome of the testing‟. He also proposes two

sources of justification – evidential and consequential, and two outcomes –

interpretation and use (Messick, 1980). Chapelle (1999) explains that in terms of

outcomes, tests allow us, on the one hand, to make inferences about the underlying

abilities of the candidates (interpretation), and, on the other hand, to make decisions

about the candidates on the basis of their test results (use). Furthermore, the labels

„evidential‟ and „consequential‟ sources of justification pertain to the arguments

18

required to justify the testing outcomes (Bachman,1990, p.243) Messick (1995)

proposes six aspects of construct validity, namely content, substantive, structural,

generalizability, external, and consequential, and specifies the sources of evidence for

each of them. What is of particular interest for this study is the external aspect of

construct validity which aims to establish the relationship with other measures so that

„the constructs represented in the assessment should rationally account for the external

pattern of correlations‟ (Messick, 1995). Thus evidence of criterion-related validity

will support an argument for the construct validity of a test.

Criterion-related validity can be defined as „the extent to which a person‟s

performance on a criterion measure can be estimated from that person‟s performance

on the assessment procedure being validated‟ (Salvia & Ysseldyke, 2001, p. 152). The

criterion chosen can be one of a number of external variables such as a syllabus,

teachers‟ judgements, performance in the real world, or another test (Davies et al,

1999). The criterion-related validity of an assessment procedure typically includes

concurrent and predictive validities, which are determined statistically by using

correlation.

The aim of concurrent validation is to investigate the relationship between an

assessment procedure, e.g. a test, and another measure for the same group of test-

takers obtained at approximately the same time (Weir, 2005, p. 207). In language

testing, a well-established standardised test such as IELTS or TOEFL can be used as a

criterion. However, Davies et al (1999, p. 30) caution that the test being validated may

define the construct of ability differently. Therefore, such a test may correlate poorly

with the standardised test chosen as criterion.

Predictive validity refers to how accurately a test score predicts performance on the

criterion measure at some future point in time (Weir, 2005, p. 209). For example, the

scores on an EAP test, whose aim is to assess students‟ readiness for university study,

should correlate highly with the academic performance of the same students as

measured by their grades on subsequent academic courses (Davies et al, 1999, p.149).

A number of predictive validity studies both of standardised and locally developed

tests have been conducted to investigate the relationship between language proficiency

19

and academic achievement. Predictably, well-established tests such as IELTS and

TOEFL, which undergo a rigorous test production process, have commissioned both

internal and external validation research in an attempt to support the claims that they

provide an adequate measure of candidates‟ readiness to „begin studying or training in

the medium of English‟ (IELTS, 2012) and their „ability to use and understand English

at the university level‟ (The TOEFL iBT Test, 2013).

2.1.7 Predictive validity of large-scale EAP tests

Predictive validity studies of language proficiency as measured by IELTS have

produced inconclusive results. In most studies, the criterion measure of choice is the

Grade Point Average (GPA) though some use first term or first year GPA to minimise

the effect of the moderator variable „elapsed time‟ (Cope, 2011). Some studies report

significant, positive, but weak relationship between IELTS scores and GPA (Feast,

2002), or moderate correlation between overall IELTS scores and first semester GPA

(Woodrow, 2006; Yen & Kuzma, 2009). However, some studies found either no

positive correlations between IELTS scores and GPA (Cotton & Conrow, 1998), or

little evidence for „the validity of IELTS as a predictor of academic success‟ (Dooey &

Oliver, 2002). On the other hand, Ingram and Bayliss (2007) investigated the

relationship between IELTS scores and three different dependent variables: student

perceptions of their own language proficiency, researcher ratings and staff

observations of the students‟ language behaviour in their academic courses. They

found that IELTS scores accurately predicted the students‟ language behaviour in the

four skills during the first six month of university study.

The extent to which the IELTS sub-tests correlate with the selected criterion has also

been explored by predictive validity studies. Cotton and Conrow (1998) found that the

reading and writing scores correlated positively with staff ratings, while in Kerstjens

and Nery‟s (2000) study only the reading score showed a significant positive

correlation with the first-semester GPA. On the other hand, Yen and Kuzma (2009)

provide evidence for significant correlations between the listening, reading and writing

scores and the first-semester GPA among a group of Chinese students. However,

Woodrow (2006) found positive correlations between all scores and first-semester

GPA with the exception of reading. What is more, the level of English language

20

proficiency appeared to influence the achievement of students scoring 6.5 or lower,

whereas with students scoring 7 and above, English proficiency has no effect on

academic performance (Woodrow, 2006).

TOEFL has also been subject to predictive validity studies. Cho and Bridgeman

(2012) found small predictive validity correlation coefficients between TOEFL iBT

and GPA with average weighted correlations r=.16 for graduates and r=.18 for

undergraduates across four broad disciplines – business, humanities and arts, science

and engineering and social sciences. However, they also showed that when students‟

TOEFL iBT scores were in the top 25%, they were twice as likely to be in the top 25%

GPA group as students in the bottom 25% TOEFL group.

Although IELTS and TOEFL seem to be the most widely recognised English language

examinations for academic study, universities in English-speaking countries recognise

a range of English language qualifications. A study by Oliver, Vanderford, and Grote

(2012) investigated the predictive validity of 25 types of proficiency evidence

accepted by one Australian university. The authors conclude that although English

language proficiency tests have low predictive validity overall, standardised tests such

as IELTS and TOEFL are the best indicators of future academic achievement. Still, the

Pearson correlations between the IELTS scores and the Weighted Average Marks

(WAM) were low. For undergraduates, the only significant correlation was between

the IELTS reading sub-test and WAM (r=.27), while for postgraduates, there was a

weak but significant relationship between WAM and the overall score and three sub-

test scores with the exception of writing (the correlations ranged from r=.16 for

speaking to r=.29 for reading).

The inconclusive results may be due to several factors. First, many of the studies are

based on small sample sizes – 101 (Feast, 2002), 33 (Cotton, F., & Conrow, 1995), 65

(Dooey & Oliver, 2002), 113 (Kerstjens & Nery, 2000), 61 (Yen & Kuzma, 2009).

Second, the criterion in each study is different. Third, participants included both

undergraduate and postgraduate students from a range of disciplines. All of these

make it difficult to generalise about the effect of proficiency on academic success.

21

2.1.8 Predictive validity of small-scale EAP assessments

The guidelines adopted by the European Association for Language Testing and

Assessment recommend that language testers should provide evidence for the validity

of the assessment instruments they develop (EALTA, 2006). Such evidence

necessarily includes predictive validity studies and some researchers have examined

the predictive power of locally-developed EAP performance tests. Cope (2011)

focused on task-based assessments closely linked to the students‟ future academic

programs at an Australian university. Black (1991) used the scores attained in two

academic skills courses as the predictor variable: EASL 140, which focused on

speaking skills, includes presentations, vocabulary, listening comprehension, and

pronunciation, and EASL 143, which aimed to develop writing skills and includes

library research techniques, writing term papers, and problematic grammatical

structures. Lee (2005) analysed an EAP placement test which is described as „a

performance assessment that requires students to summarise and integrate content

from articles and lectures‟. The test also promotes a process approach to writing and

involves pre-writing, receiving feedback, re-drafting and using a word processor.

These predictive validity studies produced a range of correlation coefficients. Cope

(2011) reports statistically significant correlations between the task-based scores for

two „direct entry‟ programmes and GPAs of r=.361 and r=.418, and one correlation

that did not achieve significance of r=.156. He interprets the statistics as showing

weak to substantial relationships, where correlations of r=0.40–0.70 are considered

substantial. Black (1991) found small to modest correlations between the skills scores

and the students‟ GPA. The author notes that even the strongest correlation (r=.392,

p<.05) for one of the groups only explains 15% of the variance, which means that the

rest, or 85% may be the result of other factors, e.g. adaptability, motivation,

organisational skills. In Lee‟s (2005) study, the correlation coefficients between the

placement performance test and first semester GPA varied. It was positive for

Business (r=.275) and Humanities (r=.350), but negative for Life Sciences (r= –.548)

and Technology (r= –.213).

Robison & Ross (1996) examined the relationship between English language

proficiency and performance on academic research tasks. They studied the extent to

which an English language placement test and an indirect Library Skills Research Test

22

predicted performance on a direct Library Skills Research Test, which represents an

authentic task for university students. The results showed that the language test alone

was a weak predictor of „the actual research skills of the EAP students‟. However,

performance on the direct Library Skills Research Test could be better predicted by a

combination of a language proficiency test and an indirect Library Skills Research

Test, which simulates the task of academic research.

On the other hand, Cho‟s (2003) investigation of the predictive validity of the „work-

shop based essay‟, which aimed to simulate as closely as possible the real-life aspects

of academic writing, showed that its predictive power was not better than that of the

timed essay when correlated with faculty evaluation of the students. As Cho (2003)

notes, relying on the workshop test for placement purposes leads to more false

positives, i.e. students are considered to have ability when they do not, whereas using

the timed essay scores results in more false negatives, i.e. students are considered not

to have ability when they do. As neither of these results is satisfactory, the author

cautions against drawing inferences about test-takers‟ abilities based on a single piece

of writing. As with the predictive validity studies of large-scale EAP tests, it is

difficult to generalise from such a variety of assessment instruments, each defining

EAP proficiency differently, using different criterion measures, and including different

sample sizes and participants from various academic fields.

Moreover, other factors besides language proficiency are likely to contribute to

academic attainment. Personal background, academic background, current teaching

and support (Feast, 2002), motivation, learning strategies, quantitative skills (Cho &

Bridgeman, 2012), intellect and acculturation (Cope, 2011) may affect achievement at

university.

2.1.9 Summary of the literature review

In summary, strong performance assessments tend to operationalize a task/context

focused construct, employ indigenous marking criteria, which might include non-

linguistic factors, are often untimed in the sense that the assessment may be carried out

over a longer time period – from a few hours to several weeks, and typically

emphasise a process approach to task completion. On the other hand, weak

performance assessments tend to operationalize a trait/ability focused construct based

23

on theories of language proficiency, and consequently make use of assessment criteria

that are primarily linguistic, are timed in the sense that test-takers usually have 30-45

minutes to complete one writing task, and are product-oriented since there is no

opportunity for the candidates to implement the process approach to writing except in

a relatively superficial form and it is the final product, e.g. an essay or a report, that is

assessed.

The aim of the present study is to compare the predictive validity of three second

language performance assessments – two weak and one strong – in the context of an

international college which offers foundation programmes to overseas students. One of

the weak performance assessments is IELTS, which is the examination of choice for

the majority of students attending the college, and the second one is an in-house

language test. The strong performance assessment is also locally developed and

consists of two end-of-course assignments. Detailed information about the assessments

will be provided in the Methodology section.

There are four main reasons that make this study worthwhile. First, testing

organisations need to provide evidence for the construct and consequential validity of

the assessments they produce. Predictive validity studies contribute to the formulation

of a convincing argument for the construct validity of assessments. Second, the results

of the study may lead to improvements in the language and study skills courses at the

college. Third, the study will address some of the limitations of previous predictive

validity studies such as small sample sizes, range restriction, and heterogeneity of

population. Fourth, the college presents an ideal opportunity to compare the predictive

validity of strong and weak performance assessments for the same group of students.

This is important as empirical evidence is necessary to ensure fairness of test use in

educational settings where decisions based on test scores can have far-reaching

implications for the individual test-taker.

2.2 Research context

The College is a private pathway provider in partnership with a UK university. It is

part of a wider network of colleges, which offer preparation courses for international

students who wish to pursue a bachelor‟s or a master‟s degree in the UK. At the

24

College, students can enrol on Foundation Certificate and Graduate Diploma courses

either in Business, Law and Social Sciences, or Science and Engineering. The present

study will focus the Foundation Certificate in Business, Law and Social Sciences.

From September 2010 to September 2013 the Foundation Certificate in Business, Law

and Social Sciences offered two pathways which consisted of eight modules (see

Figure 3).

Programme Structure and Pathways

Business Pathway

Term 1 Term 2 Term 3

Introduction to

Business

Economics in an

International Context

Business & the

Business Environment

Skills for Study 1 Statistics for the

Social Sciences

Business Enterprise

Language for Study

1

Skills for Study 2 Skills for Study 3

Language for Study 2 Language for Study 3

Law and Social Sciences Pathway

Term 1 Term 2 Term 3

Introduction to

Social Science

Economics in an

International Context

The Individual, State

and Society

Skills for Study 1 Statistics for the

Social Sciences

Introduction to Legal

Principles and

Systems

Language for Study

1

Skills for Study 2 Skills for Study 3

Language for Study 2 Language for Study 3

Figure 3. Programme structure and pathways at the College

25

Students with an IELTS overall score (or equivalent) below 5 take a one-term pre-

sessional course in English language (12 weeks), after which they can start on the

BLSS programme.

All subject and skills courses are credit-bearing and are included in the Final

Academic Score. In order to complete the programme, the students need a Final

Academic Score of 40% or higher. However, in order to progress to the host

university, the students need to achieve a Final Academic Score of 60% or higher. In

addition, students must pass an exit proficiency test with an overall score of 65% or

higher with minimum 55% in each component – reading, writing, listening and

speaking.

The Language for Study modules are non-credit bearing and therefore not included in

the Final Academic Score. Language for Study 1 is taken by three-term students,

whose English language proficiency is generally lower, with an IELTS entry test score

(or equivalent) of 5. Language for Study 2 and 3 are taken by all 3-term and 2-term

students with IELTS scores (or equivalent) of 5.5 to 6.5. Students whose IELTS entry

score (or equivalent) is 6.5 or higher are exempt from language classes and do not take

the exit proficiency test.

The Skills for Study and Language for Study modules have different objectives. The

skills classes, as their name suggests, aim to equip students with the academic skills

required for university study, ranging from time management and research skills to

applying the process approach when writing extended essays and reports, and when

preparing for presentations and group discussions. Skills for Study 1, in particular, is a

foundation course during which students are introduced to academic conventions such

as the use of academic sources, citing and referencing, the structure and organisation

of academic essays and presentations. Through tutor and peer feedback, they learn

how to improve their work by analysing the assignment question, maintaining the

focus of their essay or presentation, and providing relevant arguments and data in

support of their thesis statement. On the other hand, Language for Study 1 aims to

introduce students to the grammar, vocabulary and functional language that they need

in order to be able to complete the subject module assignments, for example, the use

of the infinitive to express purpose, or signposting words and phrases that can be used

in a presentation.

26

2.3 Research questions

a) In the college setting, to what extent can IELTS be considered an accurate

predictor of the academic achievement of international students as measured by

their Final Academic Score?

b) In the college setting, to what extent can the Language for Study 1 exam be

considered an accurate predictor of the academic achievement of international

students as measured by their Final Academic Score?

c) In the college setting, to what extent can the Skills for Study 1 assessments be



27

3. METHODOLOGY

3.1 Participants

In total 570 international students from over 40 different countries completed the

Foundation Certificate in Business, Law and Social Sciences 2-term and 3-term

programmes at the College between January 2012 and August 2013.

3.2 Instruments

The following instruments are used to measure the students‟ performance: (1) IELTS

Overall, Listening, Reading, Speaking, and Writing scores; (2) Language for Study 1

(LS1) Raw scores, Reported Scores, Listening & Speaking sub-test, and Reading &

Writing sub-test scores; (3) Skills for Study 1 (SS1) Total, Essay, and Presentation

scores; (4) Final Academic Score – the average of the eight subject and skills modules;

and (5) Final Academic Score minus Skills for Study 1 score (FAS-SS1).

3.2.1 International English Language Testing System (IETLS)

On enrolment students are required to present evidence of English language

proficiency. IELTS is by far the most popular certificate submitted by students at the

College. The IELTS test consists of four sub-tests: Academic Reading (60 minutes, 40

questions), Academic Writing (60 minutes, 2 tasks), Listening (approximately 40

minutes, 40 questions), and Speaking (up to 14 minutes, 3 parts). Marks are awarded

in half bands from 0-9. Although no data is available, typically students take the

IELTS test up to six months before arriving at the College.

3.2.2 Language for Study 1 (LS1)

The LS1exam is taken at the end of term 1 by 3-term students only. It is an example of

weak second language performance assessment (McNamara, 1996) and consists of

two sub-tests (see Appendices I and II for samples):

a) Essay – an in-class, timed, integrated reading-and-writing task. Students are

allowed 60 minutes to read 3 short extracts and use them as sources for a 250-

word essay. They must paraphrase and cite the sources, although citation skills

are not assessed. The marking criteria include: Task Achievement, Range of

28

Language, Accuracy of Language, and Cohesion and Coherence measured on a

0-100 scale.

b) Oral exam – an in-class, timed, integrated listening-and-speaking task.

Students listen to two short extracts, take notes, and summarise the speakers‟

opinion on two questions. After that, they answer up to 5 additional questions.

Assessment criteria – Task Achievement, Fluency and Coherence, Language

Range and Accuracy, and Pronunciation measured on a 0-100 scale. The exam

lasts approximately 10 minutes.

3.2.3 Skills for Study 1 (SS1)

The SS1 assessments are submitted at the end of term 1. The assessment procedure is

an example of strong second language performance assessment (McNamara, 1996). It

consists of two tasks (see Appendices III and IV for samples):

a) Essay – out-of-class, untimed. Students are given an essay question

accompanied by instructions. They are expected to analyse the essay question,

identify appropriate sources by using the online catalogue of the host

university library, skim, scan, read for detail, take notes, and write at least two

drafts. They must paraphrase, cite and reference correctly by using the APA

referencing style. The final draft of 750 words is submitted to Academic

Services and an electronic copy is uploaded to TurnItIn to be checked for

plagiarism. The assessment criteria include: Content, Structure, Support, and

Clarity of Expression measured on a 0-100 scale. There are penalties for

plagiarism (the final mark of the re-submitted essay is capped at 75% of the

awarded mark) and for late submission (see Table 1). In addition, tutors can

deduct up to 5 points for failure to follow assignment guidelines on

presentation and formatting.

b) Paired Presentation – out-of-class, untimed research and preparation, but

delivered in class within a 10-minute time limit. The topic is the same as the

essay. The students must use PowerPoint, cite and reference correctly. The

assessment criteria include: Content, Structure, Support, and Clarity of

Expression and Delivery measured on a 0-100 scale.

29

Table 1. Penalties for late submission.

Number of working days late Penalty Awarded

1 85% of original mark



More than 3 Zero mark awarded

3.2.4 Final Academic Score

An important aspect of both concurrent and predictive validity is that criterion

measure should itself be valid (Salvia & Ysseldyke, 2001, p. 153). Hartnett and

Willingham (1980) examine the pros and cons of a range of measures of academic

success and comment that overall GPAs „seem to represent a good composite of

whatever kinds of academic performance are reflected in grades‟. Therefore, in this

study the Final Academic Score that the students are awarded at the end of their 2- or

3-term programme will be used as the criterion measure.

The Final Score is the average of the eight skills and subject modules. The students

need 40% to complete the programme, but 60% to be able to progress to the host

university. The subject module assessments comprise a mixture of 1500-to-2500-word

essays, reports and business plans, and exams consisting of 10-20 MCQs plus 2-3

questions on case studies for which the students write 1-to-2-page answers. The

assessment criteria include Research, Content and Argument, Structure, Coherence

and Clarity, Presentation and Referencing. In the exam answers subject tutors look for

factually correct content, analysis, critical discussion, and relevant support. Language

is not part of the assessment criteria and grammar and vocabulary mistakes only affect

the final mark if meaning is obscured.

30

3.2.5 Final Academic Score minus Skills for Study 1

Since Skills for Study 1 is part of the Final Academic Score, a Final Score minus SS1

was calculated to ensure that the two variables are independent of each other.

3.3 Data Collection and Procedures

3.3.1 Data collection

Electronic copies of the students‟ scores were obtained from the Academic Services

office. As the record-keeping system changed in 2012, hard copies were obtained for

the Language for Study 1 and Skills for Study 1 scores recorded on the old system,

and were added to the spreadsheets.

The Final Academic Score (FAS) is available for all 570 students. The English

Language Entry scores consist of 452 IELTS overall scores and 449 sub-scores. This

is because 118 students either took other exams such as TOEFL iBT, PTE Academic,

IGCSE, or their scores are missing. As some students were exempted from Language

for Study 1 classes, only 279 LS1 total scores are available. However, 65 of those

students took the new LS1 exam introduced in January 2013 and were therefore

excluded from the analysis. The LS1 scores are rounded up or down (see Table 2)

following the same convention as outlined in the IELTS Quality and Fairness

Brochure (IELTS, 2012) and only the rounded scores are reported to the students. In

this study the total LS1 Reported scores are used.

Table 2. Rounding of LS1 raw average scores

Raw

score

Reported

score

Raw

score

≥ 37.5 40 < 42.5

≥ 42.5 45 < 47.5

≥ 47.5 50 < 52.5

≥ 52.5 55 < 57.5

≥ 57.5 60 < 62.5

≥ 62.5 65 < 67.5

≥ 67.5 70 < 72.5

31

For some students the LS1 total scores in the spreadsheets were wrong. Instead of

calculating the average score of the Listening/Speaking and Reading/Writing, the sub-

tests were weighted 40% and 60% respectively. However, this resulted in only 3

students being misclassified (e.g. awarded 60 instead of 55), who were therefore

excluded from further calculations. In addition, another student, who received 0 for the

Listening/Speaking sub-test, most likely because of absence, was excluded. Overall,

210 LS1 total scores, Listening/Speaking and Reading/Writing component scores are

available. Out of 570 students, 569 SS1 total scores are available (1 student received 0

for their essay, but since it is not clear whether they resubmitted, their score was

excluded). Out of these 569 scores, 563 consist of a total score plus essay and

presentation sub-scores.

The two correlation coefficients most commonly used in correlation studies are the

Spearman rank order correlation coefficient and the Pearson product-moment

correlation coefficient. While both coefficients show the strength and direction of the

relationship between two variables, they are underpinned by different assumptions

regarding the normality of distribution. The Pearson product-moment correlation

coefficient (r) is a parametric statistic, which assumes that the data is normally

distributed. On the other hand, the Spearman rank order correlation coefficient ( ) is a

non-parametric statistic, which does not assume normal distribution of the data

(Bachman, 2004, pp. 89-95). Another difference is that the Pearson (r) assumes that

the two variables are continuous, while Spearman ( ) assumes that the variables are

rank orders, for example ratings of speaking and writing skills (Green, 2013, p.71).

Finally, both the Pearson (r) and Spearman ( ) work on the assumption that that data

is linear, i.e. on a scatterplot it will appear as a more or less straight line, and

independent, i.e. each score „must have been generated in ways which could not have

influenced the generation of the other‟ (Brown & Rodgers, 2002, p. 182). Although

the Pearson product-moment and the Spearman rank correlation coefficients may

produce different results if applied to the same data set, their interpretation is basically

the same – they show the magnitude and direction of the relationship between the

variables in question (Bachman, 2004, p.91). Cohen (1998, cited in Pallant, 2007,

p.132) suggests the following interpretation of the strength of the relationship between

two variables:

32

Small r=.10 to .29

Medium r=.30 to .49

Large r=.50 to 1.0

3.3.2 Procedures

Data analysis was performed using SPSS Version 19. Box plots were generated to

identify outliers and extreme cases. SPSS defines outliers as points that „extend more

than 1.5 box-lengths from the edge of the box‟, and extreme cases as points that

„extend more than three box-lengths from the edge of the box‟ (Pallant, 2007, p. 63).

The outliers are represented by a small circle and extreme cases by an asterisk. There

is no clear-cut answer as to whether to delete or include the outliers in the data

analysis. Bachman (2004, p. 102) argues that deleting outliers is only justified if there

are reasonable grounds to assume that the scores do not accurately represent the

abilities measured by the test. He recommends „going back through the data

processing trail‟ in order to check if the scores are genuine and if administrative

records can provide any explanation for extremely high or low scores. For example, in

task-based EAP assessments low scores may be due to penalties incurred for

plagiarism, late submission or no submission. Furthermore, Larson-Hall (2010, p. 91)

points out that outliers are problematic for parametric statistical procedures as they

assume normal distribution of the data, and therefore by deleting them we could be

disregarding the fact that the distribution may be non-normal. Pallant (2007, p.129)

recommends checking output from Explore for the values of the mean and the

trimmed mean, i.e. the mean with the top and bottom 5% removed. If the values are

very similar, the outliers may not be cause for concern.

Scatterplots for each data set were generated using SPSS and checked for linearity. In

addition, the descriptive statistics and histograms were examined for normality of the

distribution and for homogeneity of variance, which are assumptions for parametric

tests (Pallant, 2007, p. 204). Normal Q-Q plots and Detrended normal Q-Q plots were

also inspected. Bachman and Kunnan (2005, pp. 45-46) explain that in the normal Q-Q

plot reasonably normally distributed scores should be closer to the diagonal line,

which represents what the values would be if the distribution were normal. Also, in the

Detrended normal Q-Q plots for a normal distribution the scores should cluster near

the horizontal line. Finally, the Kolmogorov-Smirnov statistic was checked as a non-

33

significant result (Sig. > .05) indicates normality, while a significant result means that

the assumption of normality has been violated (Pallant, 2007, p.62).

Correlation coefficients for each of the three sets of scores (including sub-scores) were

calculated: IELTS and the Final Academic Score; LS1 and the Final Academic Score;

SS1 and the Final Academic Score minus SS1.

Finally, histograms showing the distribution of the Final Academic Scores by IELTS

band and LS1 band were produced and the results summarised in tables for easy

reference. However, as the SS1 scores are not reported in whole or half bands, there

are 106 separate S1 scores, which makes the procedure impracticable for reporting the

distribution of the Final minus SS1 scores by SS1 score.

34

4. RESULTS

The following section presents the results of the data analysis. For each of the three

sets of variables, the findings regarding the assumption of normality are described first

as they determine the application of parametric or non-parametric measures. After

that, the obtained correlation coefficients are presented.

4.1 IELTS & Final Academic Score

4.1.1 Box plots – outliers and extreme cases

Both box plots (see Appendix V) show a small number of outliers and the IELTS

Overall Score box plot contains several extreme cases. However, there is some

evidence that the scores accurately reflect the ability of the students so there is no

justification for excluding them from further analysis. First, the IELTS mean score is

5.37, but the range is 5 (from 3.5 to 8.5). The majority of students studying at the

College do not meet the host university English language entry requirements at the

start of their programme. However, some students, particularly from Nigeria and

Singapore, have high levels of English language proficiency, but for one reason or

another do not meet the university academic requirements. The fact that IELTS has

nine levels of competence – from non-user to independent user (IELTS, 2012) –

means that in a pathway college it is not unusual to come across large differences in

language proficiency. For this reason, there are 2- and 3-term courses. Moreover,

IELTS has strict quality control procedures in place that ensure that the results are

sufficiently reliable (IELTS, 2012). Therefore, the outliers and extreme cases will not

be excluded from analysis.

4.1.2 Scatterplots

Examination of the scatterplot in Figure 4 shows that the relationship between the two

variables is linear.

35

Figure 4. Scatterplot of IELTS and Final Academic Scores

4.1.3 Descriptive statistics

Table 3. Descriptive statistics of IELTS and Final Academic Scores

Statistic IELTS FINAL

N 452 452

Mean 5.37 64.65

Std. Error of Mean .038 .317

Median 5.00 65.00

Mode 5.00 70.00

Std. Deviation .80 6.75

Variance .64 45.55

Skewness 1.238 -.615

.115 .115

Skewness / 10.77 –5.35

Kurtosis 1.928 1.219

.229 .229

Kurtosis / 6.45 4.08

Range 5.0 49.17

Minimum 3.5 31.00

Maximum 8.5 80.17

Table 3 shows that The IELTS and FAS skewness and kurtosis divided by their

respective standard errors are outside the – 2 +2 range and therefore violate the

assumption of normality. The histograms (see Appendix VI) also show that IELTS

36

scores are positively skewed, while the FAS scores are negatively skewed. The

distribution of both sets of scores is leptokurtic.

4.1.4 Q-Q plots and test of normality

The Normal Q-Q plots and Detrended normal Q-Q plots generated for both IELTS and

the Final Academic Score show deviations from the expected values if the distribution

was normal (see Appendix VII).

A Kolmogorov-Smirnov test returned Sig. value .000 for IELTS overall and all

subtests, and .010 for the Final Academic Score. A non-significant result (Sig. > .05)

indicates normality (Pallant, 2007, p.62). Therefore, the distribution of IELTS and

FAS scores is not normal.

The analysis presented above indicates that a non-parametric test will be more

appropriate, therefore Spearman rank order correlations ( ) were calculated for IELTS

(including the Reading, Writing, Speaking and Listening sub-scores) and the Final

Academic Score.

4.1.5 Non-parametric correlations

Table 4. Non-parametric correlations (Spearman ) between IELTS Overall, Reading,

Writing, Speaking, Listening and the Final Academic Score

Overall Reading Writing Speaking Listening

Reading .740**

-

(n=449)

Writing .739**

.496**

-

(n=449) (n=449)

Speaking .797**

.475**

.511**

-

(n=449) (n=449) (n=449)

Listening .855**

.605**

.561**

.638**

-

(n=449) (n=449) (n=449) (n=449)

Final Academic

Score

.467**

.397**

.401**

.313**

.453**

(n=452) (n=449) (n=449) (n=449) (n=449)

** Correlation is significant at the 0.01 level (2-tailed).

37

Significant medium positive correlations were found between the Final Academic

score and the IELTS overall score ( =.467, n=452, p< .01), Reading ( =.397, n=449,

p< .01), Writing ( =.401, n=449, p< .01), Speaking ( =.313, n=449, p< .01), and

Listening ( =.453, n=449, p< .01).

Since the students need to achieve a Final Academic Score of 60 and above to

progress to the host university, the percentage of students who obtained such a score is

presented in Table 5 by IELTS band.

Table 5. Percentage of students achieving a Final Academic Score of 60 and above by

IELTS band

IELTS

Band

N=452 Minimum

Final

Maximum

Final

Range

Final

% achieving

60+

3.5 1 55.99 55.99 0 0%

4.0 11 44.94 67.00 22.06 64%

4.5 63 50.86 70.00 19.14 52%

5.0 175 42.33 80.17 37.83 77%

5.5 96 31.00 75.00 44.00 91%

6.0 50 56.00 76.00 20.00 96%

6.5 28 54.21 75.06 20.85 93%

7.0 8 54.74 75.00 20.26 87.5%

7.5 11 64.00 80.00 16.00 100%

8.0 8 69.39 80.00 10.61 100%

8.5 1 75.69 75.69 0 100%

Table 5 shows that although higher language proficiency (as measured by IELTS)

generally results in a higher percentage of students achieving a final score of 60 and

above, there is still a significant proportion of students with an IELTS score below 5.5

who complete their academic programme with a Final Academic score of 60+. For

instance, 77% of the students who start their programme with an overall IELTS score

of 5.0 progress to the host university. The percentage for students arriving with IELTS

4.5 and 4.0 is 52% and 64% respectively.

38

4.2 LS1 & Final Academic Score


The box plot for LS1 Reported shows no outliers or extreme cases, but the box plot for

Final Academic Score (FAS) shows two outliers and one extreme case – 155 (see

Appendix VIII). Following the „data processing trail‟ revealed that student 155

dropped out of college before the resubmission date after failing some of the modules,

and was therefore excluded from further analysis.

4.2.2 Scatterplots



Figure 5. Scatterplot of LS1 Reported and Final Academic Scores

39


Table 6. Descriptive statistics of LS1 Reported and Final Academic Scores

Statistic LS1 Reported FINAL LS1L&S LS1&W

N 209 209 209 209

Mean 55.05 62.50 55.19 54.82

Std. Error of Mean .343 .416 .395 .330

Median 55.00 63.00 54.50 54.00

Mode 55.00 64.00a 51.00

a 53.00

Std. Deviation 4.95 6.01 5.71 4.77

Variance 24.517 36.124 32.617 22.713

Skewness .581 -.253 .758 .404

.168 .168 .168 .168

Skewness / 3.08 –1.51 4.51 2.40

Kurtosis .452 -.230 1.624 .302

.335 .335 .335 .335

Kurtosis / 1.35 –0.69 4.85 0.90

Range 25.00 30.38 37.50 26.80

Minimum 45.00 44.68 41.50 42.00

Maximum 70.00 75.06 79.00 68.80

a. Multiple modes exist. The smallest value is shown

Table 6 shows that FAS skewness and kurtosis divided by their respective standard

errors are within the –2 +2 range and therefore the scores are normally distributed.

LS1 Reported kurtosis divided by its standard error is within the –2 +2 range.

However, LS1 Reported skewness divided by its standard error is not within the –2 +2

range. Furthermore, both LS1 L&S kurtosis divided by its standard error and LS1

L&S skewness divided by its standard error are not within the –2 +2. LS1 R&W

kurtosis divided by its standard error is within the –2 +2 range. However, LS1 R&W

skewness divided by its standard error is not within the –2 +2.

Therefore, the assumption of normality is violated for some of the variables. Visual

inspection of Table 6 also shows that there is no homogeneity of variance.

40

4.2.4 Q-Q plots

The Normal Q-Q plots and Detrended normal Q-Q plots generated for LS1 Reported,

LS1 R&W, and LS1 L&S show deviations from the expected values if the distribution

was normal (see Appendix IX).


appropriate, therefore Spearman rank order correlations ( ) were calculated for LS1

Reported (including the Reading/Writing, Speaking/Listening sub-scores) and the

Final Academic Score.


Table 7. Non-parametric correlations (Spearman ) between LS1 Reported, Reading

& Writing sub-test, Speaking & Listening sub-test and the Final Academic Score

N=209 LS1 L&S LS1 R&W LS1 Reported FINAL

LS1 R&W .579**

-

LS1 Reported .839**

.823**

-

FINAL .342**

.501**

.479**

-

** Correlation is significant at the 0.01 level (2-tailed).

Significant medium positive correlations were found between the Final Academic

score and the LS1 Reported score ( =.479, n=209, p< .01), and between the Final

Academic score and the LS1 Listening & Speaking sub-test score ( =.342, n=209, p<

.01). A significant large positive correlation was found between the Final Academic

score and the LS1 Reading and Writing sub-test score ( =.501, n=209, p< .01).

41

Table 8. Percentage of students achieving a Final Academic Score of 60 and above by

LS1 Reported score

LS1

Reported

N=209 Minimum

Final

Maximum

Final

Range

Final

% achieving

60+

45.00 6 54.05 63.00 8.95 17%

50.00 61 44.68 68.13 23.45 56%

55.00 85 47.00 74.00 27.00 67%

60.00 43 50.19 75.06 24.87 93%

65.00 10 63.30 75.00 11.70 100%

70.00 4 54.21 74.00 19.79 75%

Table 8 shows that although higher language proficiency (as measured by LS1 overall

reported score) generally results in a higher percentage of students achieving a final

score of 60 and above, there is still a sizable proportion of students with scores of 50

to 55, who complete their academic programme with a score of 60+, which allows

them to progress to the host university.

4.3 SS1 and Final minus SS1

As the students‟ Final Academic Score includes the Skills for Study 1 score, which

may affect the correlation coefficient, a new FAS minus SS1 score was calculated.

However, since significant large positive correlations were found between the Final

Academic scores and the Final minus SS1 scores ( =.996, n=568, p< .01) with the

correlation coefficient very close to 1, we can be certain that „all that may be known

about a person with regard to the one variable is perfectly revealed by the other

variable‟ (Henning, 1987, p. 59), or in other words FAS and FAS minus SS1 measure

essentially the same construct. Therefore, the Final minus SS1 score was used in

assessing the predictive validly of the SS1 scores.


Both box plots (see Appendix X) show a small number of outliers and one extreme

case (432) among the Final minus SS1 scores. Following the „data processing trail‟

revealed that 432 was the same student who dropped out of college before the

42

resubmission date after failing some of the modules, and was therefore excluded from

further analysis. Individuals who receive low scores in SS1 or the subject modules

usually do so for one of three reasons: (a) they committed plagiarism and their final

mark was capped at 75% after resubmission, i.e. a student who would have obtained a

score of 60, actually received 45; (b) they submitted their assignment late and as a

result had their score reduced by 15%-25% in line with the regulations mentioned in

3.2.3; (c) they submitted work which was of poor quality. On the other hand,

individuals with high scores typically submit work that is above average. Since SS1 is

a performance assessment in the „strong‟ sense, its criteria are similar to ones applied

by the subject specialists at the College, which include penalties for plagiarism and

poor time management skills. Thus, it can be said that the results accurately reflect the

abilities of the students to complete the tasks and therefore the outliers should not be

excluded from further analysis.

4.3.2 Scatterplot



Figure 6. Scatterplot of SS1 and Final minis SS1 scores

43


Table 9. Descriptive statistics of SS1 and Final minus SS1

SS1 SS1E SS1P F-SS1

N 568 562 562 568

Mean 61.57 61.30 61.86 65.55

Std. Error of Mean .226 .262 .228 .298

Median 62.00 61.50 62.00 66.14

Mode 62.00 60.00a 62.00 71.14

a

Std. Deviation 5.38 6.22 5.40 7.10

Variance 28.896 38.630 29.155 50.440

Skewness -.232 -.354 -.157 -.469

.103 .103 .103 .103

Skewness / -2.252 -3.44 -1.52 -4.553

Kurtosis 1.133 1.500 .544 .180

.205 .206 .206 .205

Kurtosis / 5.527 7.28 2.64 0.878

Range 39.00 44.50 37.00 41.52

Minimum 40.00 35.50 42.00 40.24

Maximum 79.00 80.00 79.00 81.76

a. Multiple modes exist. The smallest value is shown

Table 9 shows that SS1 skewness and kurtosis divided by their respective standard

errors are not within the –2 +2 range and therefore violate the assumption of

normality. Although the Final minus SS1 kurtosis divided by its standard error is

within the –2 +2 range, the result for skewness shows that the Final minus SS1 scores

are negatively skewed. Moreover, there is no homogeneity of variance as the two

statistics are very different.

4.3.4 Q-Q plots and test of normality

The Normal Q-Q plots and Detrended normal Q-Q plots generated for both SS1 and

Final minus SS1show deviations from the expected values if the distribution was

normal (see Appendix XI).

A Kolmogorov-Smirnov test returned Sig. value .004 for F-SS1 and .001 for SS1. As a

non-significant result (Sig. > .05) indicates normality (Pallant, 2007, p.62), the

distribution of scores for SS1 and Final minus SS1 is not normal.

44


appropriate, therefore Spearman rank order correlations ( ) were calculated for SS1

(including the essay and presentation sub-scores) and Final minus SS1.


Table 10. Non-parametric correlations between SS1 Overall, SS1 Essay, SS1

Presentation, and Final minus SS1

SS1 SS1Essay SS1Presentation

SS1Essay .925**

(n=562)

SS1Presentation .823**

.579**

(n=562) (n=562)

Final Minus SS1 .593**

.529**

.529**

(n=568) (n=562) (n=562)

**. Correlation is significant at the 0.01 level (2-tailed).

Significant large positive correlations were found between the Final Minus SS1 score

and the SS1 Overall score ( =.593, n=568, p< .01), the SS1 Essay ( =.529, n=562,

p< .01), and the SS1 Presentation =.529, n=562, p< .01).

45

5. DISCUSSION OF RESULTS

5.1 Research question 1

In the college setting, to what extent can IELTS be considered an accurate

predictor of the academic achievement of international students as measured by

their Final Academic Score?

The results of this study confirm the findings of Woodrow (2006) and Yen and Kuzma

(2009), who also reported correlations between overall IELTS scores and 1st semester

GPA of r=.40 and r=.46 respectively. One possible reason for the medium positive

correlations may be the homogeneity of the sample. Studies that found no or little

evidence of the predicative power of IELTS (Cotton & Conrow, 1998; Dooey &

Oliver, 2002) focused on heterogeneous populations. For example, Cotton &

Conrow‟s (1998) participants came from the schools of Science and Technology,

Architecture, Engineering, Health science, Humanities, Social Sciences, Visual and

Performing Arts, Commerce and Law. Dooey and Oliver (2006) recruited participants

from three disciplines – business, science and engineering. However, different

disciplines most likely require different levels of English language proficiency. Not

surprisingly, the IELTS test score guidelines divide university courses into two

categories: linguistically demanding and linguistically less demanding, and suggest

different IELTS entry scores for each (IELTS, 2011, p.13). Indeed, Cotton and

Conrow (1998) argue that the fact that 45% of the students in their study attended

linguistically less demanding courses such as science and engineering weakened „any

link between language proficiency measures and academic outcomes‟. For such

students, mathematical ability, for example, may be more important for academic

success than language skills. Moreover, when selecting the criterion measure it should

be borne in mind that final grades may show substantial variations within and across

disciplines (Hartnett & Willingham, 1980). The present study circumvents these

limitations because, first, all the participants were foundation certificate students

enrolled on the Business, Law and Social Sciences programme, and, second, as many

of the subject tutors at the college teach on more than one course there is bound to be

more agreement among tutors about what constitutes academic success and greater

comparability of the assessment criteria. Therefore, it seems reasonable to assume that

46

the medium positive correlations reported above may partially be due to the relative

homogeneity of the population.

Another contributing factor may be the wider range of IELTS scores in the present

study – from 3.5 to 8.5. A truncated sample may show weak correlations and lead to

incorrect conclusions about the relationship between language proficiency and

academic achievement (Henning, 1987, p.66; Bachman, 2004, p.96). Indeed, many

predictive validity studies emphasise the effect of range restriction on their results

(Cotton & Conrow, 1998; Dooey & Oliver, 2002; Woodrow, 2006; Cho & Bridgeman,

2012). This is not surprising as universities only enrol students who have met their

minimum English language entry requirements, commonly set at IELTS 5.5 to 7 for

undergraduates at UK universities (Hayatt & Brooks, 2009, p14). An advantage of the

current study is that the students attended a pathway programme before starting

university and therefore were accepted even if their IELTS score was lower than 5.5.

In fact, 250 out of the 452 students, or 55%, had an IELTS score of 5.0 or less on

entry. As mentioned in section 2.2, such students are required to take a 12-week pre-

sessional English language course and it could be argued that their level of English at

the start of their academic programme was higher than when they took their IELTS

test. However, two studies cast some doubt about the likelihood of quick score gain as

a result of intensive study. First, Elder and O‟Loughlin (2003) investigated whether

attending a 10-12-week intensive English course had an effect on the IELTS scores of

112 international students in Australia and New Zealand. The found that the maximum

score gain was 2 bands and the minimum -1, with an „average overall gain […]

slightly more than half a band‟. The second study by O‟Loughlin and Arkoudis (2009)

examined score gain over a longer period. They compared the IELTS scores of 30

undergraduate and 33 postgraduate students before and after graduating from an

Australian university. The results reveal that the highest average increase in scores

was in Reading (0.532), followed by Listening (0.500), Speaking (0.444) and Writing

(0.206). This suggests that the language proficiency of the students in the present

study was unlikely to have been dramatically affected by their attending a one-term

pre-sessional language course. Although we need to bear in mind that lower level

students demonstrate greater progress in language learning than their more proficient

counterparts within the same time period (O‟Loughlin & Akkoudis, 2009), it is still

47

the case that in the present study there is less range restriction compared with the

majority of the predictive validity studies discussed in the literature review.

Despite the fact that there were significant medium positive correlations between the

Final Academic score and the IELTS overall and subtest scores, these are not accurate

enough to allow for precise group predictions. As can be seen from Table 5 in 4.1.5, a

substantial percentage of students who started their programme with an IELTS score

between 4.0 and 5.5 managed to achieve a Final Academic score of 60% and above,

which allowed them to progress to the host university. To account for this we need to

examine the view put forward in the literature review that IELTS is a timed, product-

oriented, weak performance assessment, which simulates real-life tasks with the aim

of sampling a candidate‟s language ability and whose construct definition is

ability/trait focused.

First, there are some major differences between the IELTS speaking construct and the

oral assessments at the College. While the IELTS speaking tasks attempt to simulate

real-life situations that students may encounter at university, such as giving personal

information, making a presentation and stating an opinion, all of this is done within a

14-minute period (IELTS, 2011). The candidates are expected to use their general

knowledge and discuss topics that have been carefully selected to avoid bias against

any group of candidates (IELTS, 2012). The marking criteria are predominantly

linguistic: fluency and coherence, lexical resource, grammatical range and accuracy,

and pronunciation (IELTS, 2013). The oral presentations in the college assessments,

on the other hand, emphasise research skills, use of academic sources, preparation and

practice. Moreover, the assessment criteria include non-linguistic factors such as

appropriate use of gestures, eye contact, tone of voice and body language, in addition

to content, logical structure and support from sources. When Yen and Kuzma (2009)

found a small negative correlation between Speaking and 1st semester GPA, they

suggested as a possible reason that the GPA scores were primarily based on written

assignments. In the present study, oral assessments also form a small part of the Final

Academic Score; in fact, there are only three oral assessments – one presentation for

Business Enterprise, a group discussion for Skills for Study 2, and an oral presentation

for Skills for Study 3. Therefore, the different abilities measured by IELTS Speaking

and the oral assessments at the College and the relatively small number of the latter

48

might explain why the correlation between the Final Academic score and the IELTS

Speaking sub-scores ( =.313) is at the lower end of the medium correlation range.

Furthermore, the IELTS writing construct has also been criticised as not being

reflective of academic writing at university. A comparison of the marking criteria

shows that while IELTS scripts receive a grade for task achievement, coherence and

cohesion, lexical resource, and grammatical range and accuracy, subject module

criteria at the College include research, content and argument, structure, coherence and

clarity, and presentation and referencing. Moore and Morton (2007, p.228) conclude

that despite some similarities between academic writing and the IELTS essay, the

latter, with its emphasis on personal opinion, is „suggestive of such public, non-

academic genres as the letter to the editor or the newspaper editorial‟. Leki (2011, p.

103) even argues that English language university screening tests like IELTS have had

a negative washback leading to the creation of „the monster of the English L2 essay

exam genre‟, which reduces writing skills to learning certain formulas and structures

in the hope that slotting them neatly together would guarantee success. The medium

correlation coefficient ( =.401) that emerged from the present study may be seen as

providing further evidence that the writing construct operationalized by IELTS is quite

district from the construct of academic writing at university. The findings also appear

to support Leki‟s (2011, p.106) view that once students are introduced to the

conventions and expectations of university writing, they have „no trouble setting aside

the essay exam as, at best, marginal to their new writing tasks‟.

Finally, IELTS is timed and product-oriented. Candidates are given 20 minutes to

complete Writing task 1 and 40 minutes for Writing task 2. Although they are

encouraged to spend a few minutes planning their response and a few minutes

checking it at the end, it is clear that there is no time to produce more than one draft.

This contrasts with the process approach adopted by the course tutors at the College,

where students work on their assignments for most of the term, hand in detailed plans

or first drafts, and receive feedback before formally submitting their final draft for

marking.

As far as reading skills are concerned, the medium correlation ( =.397) between the

IELTS Reading sub-scores and the Final Academic Score may be due to that fact that

49

in an IELTS exam the candidates are restricted not only by the time available but also

to reading three pre-selected texts, while in academic settings they have access to a

range of sources and even if they fail to understand some points in the text, they may

still be able to identify and use appropriate arguments in their written assignments.

Among the four skills assessed by IELTS Listening correlates the highest with the

Final Academic Score ( =.453) in the present study. This finding is somewhat

surprising, although Yen and Kuzma (2009) and Woodrow (2006) found similar

correlations of r=.45 and r=.35 respectively between IELTS Listening and 1st semester

GPA. Obviously, listening is an important skill in lectures and seminars, so although it

is not specifically assessed in the subject modules, it must contribute to the students‟

ability to engage with the course content.

Commenting on the meaningfulness of correlation coefficients, Hughes (2003, p.32)

notes that „if the coefficients between scores on the same construct are consistently

higher than those between scores on different constructs, then we have evidence that

we are indeed measuring separate and identifiable constructs‟. This raises the question

of whether the abilities measured by IELTS can form the basis for making inferences

about the students‟ ability to complete the academic tasks required of them by subject

tutors.


In the college setting, to what extent can the Language for Study 1 exam be



The Language for Study 1 exam, like IELTS, is a weak performance test whose

construct is trait/ability focused. However, there is one major difference – the LS1 test

contains integrated Reading/Writing and Listening/Speaking tasks. In this respect, it

implements one of the recommendations put forward by Moore and Morton (2007, p.

235), namely the inclusion of source texts, which the test-takers read in order to

identify relevant information which they must then incorporate in their response.

Moore and Morton (2007) argue that such a change to the current IELTS writing task

50

will increase its authenticity as it is more representative of university writing. The

results from the present study seem to support this view as the correlation coefficient

between the Final Academic scores and the LS1 Reading & Writing sub-test scores

( =.501, n=209, p<.01) is higher than the correlation coefficient between the Final

Academic scores and the IELTS Writing scores ( =.401, n=449, p<.01). Although the

reading-to-write task assessment criteria are almost identical to the ones used to grade

the IELTS essay task – Task achievement, Range of Language, Accuracy of

Language, Cohesion and coherence – the Task achievement criterion includes the

extent to which the arguments presented in the essay are supported by incorporating

information from at least two of the excerpts provided. Students are assessed on their

ability to paraphrase and penalised if they lift phrases from the source texts. Therefore,

although the LS1 Reading & Writing task can be considered an instance of „role-

playing‟ (McNamara, 1996), the context provided by task is realistic and simulates

some aspects of academic writing at university such as summarising, paraphrasing and

using in-text citations.

On the other hand, the correlation between the Final Academic score and the LS1

Listening & Speaking task score, although significant, is at the lower end of the

medium correlations ( =.342, n=209, p<.01), which is comparable to the correlation

between the Final Academic score and the IELTS Speaking score ( =.313, n=449, p<

.01). The fact that these two correlations are the lowest in the present study suggests

that the Listening/Speaking construct in LS1 and the Speaking construct in IELTS

may be under-representations of the traits and abilities that are important for

successful oral performance in academic contexts. Indeed, the assessment criteria for

the integrated Listening & Speaking tasks – Task Achievement, Fluency and

Coherence, Language Range and Accuracy, and Pronunciation – strongly resemble the

IELTS Speaking marking criteria. The only difference is the introduction of a Task

achievement criterion, which evaluates the extent to which examinees correctly

summarise the opinions of the speakers in the listening section. These criteria differ

significantly from the ones used to assess academic presentations, which include

Support, i.e. the appropriate use of academic sources and correct referencing, and

Clarity of delivery, i.e. suitable body language, gestures, speech rate and tone of voice.

51

The LS1 exam is also timed and product-oriented. Students are given 60 minutes to

complete the Reading & Writing task without sufficient time to revise and make

alterations, so the final product can only be viewed as a first draft (Weigle, 2002;

Hartill, 2000). The Listening & Speaking part lasts approximately 10 minutes. The

recording is heard once only, and the students have little time to organise or think

about their notes before they are asked to respond to the questions. In the interest of

fairness to all examinees, the tutors are not allowed to rephrase the questions.

Although this may enhance the reliability of the test, it hardly resembles what happens

in real-life communication. In a lecture, if students miss a particular point, they may

reasonably expect that it will be repeated in the summary at the end. And even if it is

not, they can rely on the lecture notes or the coursebook. In a conversation, the speaker

will most likely be willing to paraphrase and explain if their interlocutor does not

understand.

The LS1 exam is modelled on large-scale tests such as IELTS, which, according to

Weigle (2002, p.175) are mostly concerned with reliability and practicality, i.e.

providing reliable scores within a short period of time to a large number of test-takers.

The other aspects of test usefulness suggested by Bachman and Palmer (1996), namely

construct validity, authenticity, interactiveness, and impact, are not neglected, of

course, but Weigle (2002, p.175) argues that they are more likely to be prioritised by

achievement assessments as teachers are more concerned with achieving the course

objectives, preparing students for real-world tasks and ensuring they engage in the

process of writing.

Overall, scores on the LS1 exam, and especially the reading-to-write task, seem a

better predictor of academic success at the College than IELTS scores. However, it

should be borne in mind that the sample is twice as small (n=209) as the IELTS

sample (n=452), and it is truncated since all the students who take the course are

selected because their entry IELTS score is below 6.5. The correlation coefficient may

be different, if higher level students took the exam as well. The results also show a

number of false negatives, i.e. some students who obtained low marks still achieved a

score of 60+ after 1 or 2 more terms at the College. Therefore, it would be difficult to

base individual predictions on the scores of the LS1 exam only.

52


In the college setting, to what extent can the Skills for Study 1 assessments be



The positive correlations observed in the present study between the SS1 scores and the

Final minus SS1 scores are higher than the correlations reported by the predictive

validity studies cited in the literature review. It was noted earlier that comparisons

between the predictive validity of different EAP assessment instruments need to take

into account the way the construct is defined and operationalized in each test. A higher

correlation would imply a stronger similarity between the predictor and the criterion

variables.

The SS1 assessments tend towards the „strong‟ end of McNamara‟s (1996) scale of

second language performance assessments as their construct is task/context focused.

The main aim of the SS1 assessments is to answer the question: Are the students able

to write an academic essay and give an academic presentation in English? The

assessment criteria of SS1 are modelled on the criteria used by subject tutors when

evaluating student performance in academic contexts. Content, structure, support and

clarity of expression are each given equal weighting, with the first three criteria

reflecting the factors cited by Errey (2000) as the ones that influence subject tutors the

most. A comparison with the marking criteria for some of the BLSS modules reveals

that students are assessed on research, content and argument, structure, coherence, and

referencing. Moreover, for the oral presentation, non-linguistic factors such as body

language, eye-contact, and voice projection are important for successful completion of

task. Such criteria are combined with content, which subsumes research skills,

structure and support from academic sources to conform to the indigenous assessment

criteria of the academic community.

In addition, the SS1 assessments are in essence achievement tests based on course

outcomes, which comprise a range of skills that students need in order to succeed in

53

their academic studies – analysing assignment tasks, performing library searches,

identifying relevant sources, skimming, scanning, reading for detail, note-taking,

synthesising information, planning, writing drafts that adhere to specific academic

genre conventions, using sources, citing and referencing correctly. Therefore, it should

not be surprising that the SS1 scores correlate well with the final academic scores as

these skills are transferrable across academic disciplines. As Leki (2011, p.104) points

out students „are less likely to transfer knowledge to new situations perceived as

unlike the old ones‟, but if students understand that their subject tutors value the same

skills they are taught in their SS1 classes, such a transfer is more likely to occur.

Another factor that may contribute to the higher predictive validity of the SS1

assessments is that they are out-of-class and untimed. Hartill (2000) notes that the Law

Department at a UK university introduced take-home papers because they realised that

even non-native speakers of English who generally produced superior work to their

native-speaker counterparts often underperformed under exam conditions. Slaght and

Howell (2007, p. 255) also point out that grades from out-of-class assessments should

be part of the overall evaluation of student performance as timed tests may not always

be the best measure of a person‟s abilities. The SS1 assessments allow the students not

only enough time to research a topic thoroughly, but also the opportunity to revise and

improve the language and style of their essay. Undoubtedly, when working on out-of-

class assignments, students have access to dictionaries, thesauruses and translation

software, although there is no guarantee that these necessarily improve the quality of

lexis and grammar in student essays. For example, one study of dictionary use by

international students in a UK university found that students often experienced

problems with identifying the appropriate sense of the word they were looking up,

which sometimes resulted in a serious misconstruction of the text (Nesi, 2002).

Moreover, the product of translation software, regardless of whether it is used in

reading comprehension or as a writing tool, is by no means perfect, especially between

languages with „striking lexico-sematic and structural differences‟ (Niño, 2009).

Therefore, it would be an oversimplification to argue that access to reference books

and tools such as Google translate automatically improves the quality of student

writing.

54

In fact, a more important determinant of the better predictive validity of strong

performance assessments may be the process approach to writing utilised by the Skills

for Study 1 course. Students receive support from their tutors in the form of feedback

on an essay plan and a first draft before they submit their essay for grading. Although

some authors have questioned the validity of assessments completed with some

assistance from peers or tutors (Gearhart & Herman, 1998), Weigle (2002, p. 177)

argues that in the real world, not least in academic settings, the „social aspects of

writing‟ are just as important as the individual ability to produce texts. She even points

out that as an author she has benefited from the feedback and suggestions received

from colleagues and editors. Thus, it could be argued then the SS1 assessments are

more reflective of writing in the real world than a timed impromptu test.

Of course, the reliability of out-of-class assessments should not be neglected. Just as

with large-scale language tests, it is necessary to implement proper scoring procedures

(Weigle, 2002, p.182) so that all stakeholders, including the examinees and the host

university, can be assured that the assessments are a fair and reliable measure of

ability. In the College, this is achieved by means of termly standardisation meetings

and the double marking of 10-20% of the essay scripts and presentations. Another

threat to reliability is plagiarism. If students submit assignments that have been

partially or wholly plagiarised, the score awarded will not be a reflection of their

ability. Chen and Ku (2008, p.87) summarise the factors that contribute to plagiarism

on the part of international students: cultural attitudes, misunderstanding of what

constitutes plagiarism, lower English language proficiency, and inadequate

institutional policies. In the College, this problem is addressed by (a) in-class activities

aimed at raising awareness of plagiarism and ways to avoid it, and (b) the use of

TurnItIn, defined as an „electronic detection service‟ (Salmons, 2008, p. 210), to locate

passages from electronic sources that have been reproduced word-for-word.

One of the possible reasons why the correlation, although large at =.593, is not even

higher is that at the end of term one, the students are still adapting to the requirements

of academic study in the UK. After submitting their SS1 assignments, they have one or

two terms to further develop the skills required for academic success. Another factor

may be the extent to which the students master the content of their subject modules.

Although SS1 can be classified as strong performance assessment, the topic of the

55

essay and the presentation is not directly related to the students‟ area of study. This is

because (1) the assessments are designed for students on both the Business, Law and

Social Sciences, and Science and Engineering programmes, and (2) the tutors on the

course are not necessarily subject specialists in business or social sciences. Therefore,

it is conceivable that students may acquire the necessary research and academic

writing skills, but still perform unsatisfactorily in their subject modules due to

inadequate content knowledge.

Overall, the large positive correlation between SS1 and the Final minus SS1 scores is

to a certain extent due to the observation that as a strong performance assessment SS1

incorporates a range of the real-life skills required for academic success. These are

reflected both in the assessment setting and in the marking criteria. Out-of-class,

untimed assessments are the norm in higher education. Moreover, subject specialists

employ marking criteria that focus primarily on content, text structure, logical

argumentation, and use of academic sources, and in that sense SS1 represents an

authentic assessment of students‟ readiness for academic study.

6. CONCLUSIONS AND IMPLICATIONS FOR FUTURE RESEARCH

The present study investigated the predictive validity of weak and strong second

language performance assessments in a university pathway programme. Spearman

rank order correlations were calculated between the scores of two weak performance

assessments – IELTS and Language for Study 1 (LS1), which is an in-house end-of-

term language examination – and the Final Academic Score, which is a composite of

eight scores that students obtain at the end of their studies and which determines their

progression to the host university. In addition, Spearman rank order correlations were

calculated between the scores from one strong performance assessment – Skills for

Study 1 (SS1), which consists of an academic essay and presentation – and the Final

Academic Score minus SS1.

The findings appear to confirm the views expressed in the literature that assessments

based on the view of language proficiency as „pragmatic ascription‟ (Bachman, 1990,

p.254), i.e. someone is „able to do X‟ in the language attain a greater degree of success

56

when predicting real-life performance (Skehan, 1998). Significant large positive

correlations were found between the SS1 Overall score and the Final Minus SS1 score

( =.593, n=568, p< .01), while although significant the correlations between the Final

Academic score and the LS1 Reported score ( =.479, n=209, p< .01), and the Final

Academic score and the IELTS overall score ( =.467, n=452, p< .01) are smaller and

can be described as medium (Cohen, 1998, cited in Pallant, 2007, p.132).

It was argued that strong second language performance assessments are better

predictors of success in university pathway programmes as they are likely to simulate

the conditions of the target language use situation more accurately than weak

performance assessments, which typically bear only a superficial resemblance to real-

world academic tasks. The main features of strong second language performance

assessments that increase their predictive power are: the task/context approach to

construct definition, the use of indigenous marking criteria, and their untimed and

process-oriented nature. On the other hand, the characteristics of weak second

language performance assessments – the trait/ability focused approach to construct

definition, the use of primarily linguistic marking criteria, and their timed, product-

oriented nature – stem from the view of language proficiency as a „theoretical

construct‟ (Bachman, 1990, p.254), i.e. someone „has ability X‟. Such assessments are

more appropriate for determining the level of ability of individuals.

One interesting area for future research would be to establish the link between

„someone has ability Y‟ and „someone is able to do X‟. Bachman (1990, p. 253)

argues that an assessment that predicts future behaviour cannot be seen as an indicator

of ability. On the other hand, weak performance assessments that indicate ability are

often used as predictors of future behaviour. The present study demonstrates that this

may not be a valid use of such tests. Perhaps, in the interest of fairness, it is advisable

to place test-takers on both a language ability scale and a successful task completion

scale. It is conceivable, in my view, that people may exhibit different levels of, say,

reading comprehension in a timed test designed to measure individual language

ability, and in an untimed assessment which allows access to external resources. The

question for educational institutions will be: What is valued more – spontaneous

impromptu demonstration of receptive and productive skills, or unhurried and

57

rehearsed performance? If the answer is the former, only students approaching native-

like ability should be admitted to university to succeed or fail just like their native-

speaker counterparts. If the answer is the latter, then students who, through strong

performance assessments, show ability to adapt and develop should not be denied

access to higher education on the basis of weak second language performance

assessments.

Another point that merits consideration is the extent to which the findings of the

present study have any meaningful practical applications. For example, do they

enhance our ability to predict academic achievement? Unfortunately, according to

Cohen, Manion, and Morrison (2000, p.202) only correlations over 0.85 allow for

accurate individual or group predictions. The authors note, however, that such high

correlations are rare in education studies. In the present study the correlations are

within the 0.35 to 0.65 range and significant at the 0.01 level. Cohen, Manion, and

Morrison (2000, p.202) state that correlations in this range are „of little use for

individual prediction‟ and can be used to make only „crude group predictions‟.

However, they point out that their value may increase if used as part of multiple

regression analysis. As mentioned in the literature review, a number of factors have

been identified as contributing to academic success: motivation, learning strategies,

quantitative skills (Cho & Bridgeman, 2012), intellect and acculturation (Cope, 2011),

personal background, academic background, teaching and support (Feast, 2002).

Therefore, future research making use of multiple regression analysis may focus on

what combination of factors most successfully predicts academic achievement in the

College.

The study also has some limitations. For example, although the majority of the

assessments that comprise the Final Academic Score involve extensive writing, one

subject – Statistics for the Social Sciences – assesses completely different skills. This

study did not focus on the individual grades that count towards the Final Academic

Score, but there is some anecdotal evidence that Chinese students, in particular,

perform better in the Statistics module, which may boost their final score. As some

predictive validity studies of IELTS have shown, the subject courses that students

attend may have an effect on the predictive power of second language performance

assessments (Cotton & Conrow, 1995; Dooey & Oliver, 2002). Therefore, it may be

useful to explore the extent to which the inclusion of Statistics for the Social Sciences

58

in the Final Academic Score affects the strength of the correlation. Related to this

issue is the focus on Foundation Certificate of Business, Law and Social Sciences

programme in the present study. However, a certain proportion of the students in the

College study on the Foundation Certificate of Science and Engineering. In addition, a

number of students are enrolled in the Graduate Diploma programmes both in

Business, Law and Social Sciences, and Science and Engineering. All these students

take the same Language for Study and Skills for Study modules and therefore

complete the same assessments. It is important to ascertain the extent to which the LS1

and SS1 assessments predict the academic achievement as measured by the FAS for

these groups of students as well. It is not inconceivable that the assessments may be

valid predictors for one group, but not for another. Such a study may lead to changes

in the curriculum in order to ensure that the skills students acquire and are assessed on

are relevant.

Finally, a positive relationship between two variables X and Y „does not imply in any

way that X influences, affects, or causes Y‟ (Chen & Popovich, 2002). What the

correlations reported in the present study demonstrate is that there is a tendency for

students who perform better in second language performance assessments to achieve

higher marks at the end of their Foundation Certificate programme, and that this

tendency is more pronounced for strong performance assessments.

59

7. REFERENCES

Alderson, J. (2000). Testing in EAP: Progress? Achievement? Proficiency? In G. M.

Blue, J. Milton, & J. Saville (Eds.), Assessing English for Academic Purposes

(pp. 21-47). Bern: Peter Lang.

Alderson, J. C., Clapham, C., & Wall, D. (1995). Language Test Construction and

Evaluation. Cambridge: Cambridge University Press.

Bachman, L. (1990). Fundamental Considerations in Language Testing. Oxford:

Oxford University Press.

Bachman, L. (2004). Statistical Analyses for Language Assessment. Cambridge:

Cambridge University Press.

Bachman, L. (2007). What is the construct? The dialectic of abilities and contexts in

defining constructs in language assessment. In J. Fox, M. Wesche, D. Bayliss,

L. Cheng, C. Turner, & C. Doe (Eds.), Language Testing Reconsidered (pp.

41-71). Ottawa: University of Ottawa Press.

Bachman, L. F., & Palmer, A. S. (1996). Language Testing in Practice. Oxford:


Bachman, L., & Kunnan, A. (2005). Statistical Analyses for Language Assessment:

Workbook and CD-ROM. Cambridge: Cambridge University Press.

Banerjee, J., & Wall, D. (2006). Assessing and reporting performances on pre-

sessional EAP courses: Developing a final assessment checklist and

investigating its validity. Journal of English for Academic Purposes, 5, 50-69.

Black, J. (1991). Performance in English Skills Courses and Overall Academic

Achievement. TESL Canada Journal, 9(1), 42-56.

Brindley, G. (2009). Task-centred language assessment in language learning. In K.

Van den Branden, M. Bygate, & J. Norris (Eds.), Task-Based Language

Teaching: A Reader (pp. 435-454). Amsterdam: John Benjamins Publishing

Company.

Brown, J. D. (2004). Performance Assessment: Existing Literature and Directions for

Research. Second Language Studies, 22(2), 91-139.

Brown, J. D., & Rodgers, T. S. (2002). Doing Second Language Research. Oxford:


60

Brown, J., Hudson, T., Norris, J., & Bonk, W. (2002). An Investigation of Second

Language Task-Based Perfomance Assessments. Honolulu: University of

Hawaii Press.

CAEL. (2013). What does the CAEL Assessment measure? Retrieved from

http://www.cael.ca/edu/format.shtml

Caudery, T. (1990). The validity of timed essay tests in the assessment of writing

skills. ELT Journal, 44(2), 122-131.

Chapelle, C. (1999). Validity in language assessment. Annual Review of Applied

Linguistics, 19, 254-272.

Chen, P. Y., & Popovich, P. M. (2002). Correlation: Parametric and Nonparametric

Measures. Sage University Papers Series on Quantitative Applications in the

Social Sciences, 07-139. Thousand Oaks, CA: Sage Publications.

Chen, T., & Ku, N. K. (2008). EFL Students: Factors Contributing to Online

Plagiarism. In T. Roberts (Ed.), Student Plagiarism in an Online World:

Problems and Solutions (pp. 77-91). Hershey, PA: Information Science

Reference.

Cho, Y. (2003). Assessing writing: Are we bound by only one method? Assessing

Writing, 8, 165-191.

Cho, Y., & Bridgeman, B. (2012). Relationship of TOEFL iBT scores to academic

performance: Some evidence from American universities. Language Testing,

29(3), 421-442.

Cohen, L., Manion, L., & Morrison, K. (2000). Research Methods in Education, 5th

Edition. London: RoutledgeFalmer.

Cope, N. (2011). Evaluating Locally-Developed Language Testing: A Predictive

Study of 'Direct Entry' Language Programs at an Australian University.

Australian Review of Applied Linguistics, 34(1), 40-59.

Cotton, F., & Conrow, F. (1995). An Investigation into the predictive validity of

IELTS amongst a group of international students studying at the university of

Tasmania. IELTS Research Reports, 1, Report 4, 72-115.

Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (1999).

Dictionary of Language testing. Cambridge: Cambridge University Press.

Dooey, P., & Oliver, R. (2002). An investigation into the predictive validity of the

IELTS Test as an indicator of future academic success. Prospect, 17(1), 36-54.

Douglas, D. (2000). Assessing Languages for Specific Purposes. Cambridge :

Cambridge University Press.

61

Douglas, D. (2010). Understanding Language Testing. Oxon: Hodder Education.

Dudley-Evans, T., & St John, M. (1998). Developments in English for Specific

Purposes. Cambridge: Cambridge University Press.

EALTA. (2006). EALTA Guidelines for Good Practice in Language Testing and

Assessment. Retrieved from http://www.ealta.eu.org/guidelines

Elder, C., & O'Loughlin, K. (2003). Investigating the relationship between intensive

English language study and band score gain on IELTS. IELTS Research

Reports, 4, report 6.

Errey, L. (2000). Stacking the Decks: What does it Take to Satisfy Academic Readers'

Requirements? In G. M. Blue, J. Milton, & J. Saville (Eds.), Assessing English

for Academic Purposes (pp. 147-167). Bern: Peter Lang.

Feast, V. (2002). The Impact of IELTS Scores on Performance at University.

International Education Journal, 3(4), 70-85.

Gearhart, M., & Herman, J. (1998). Portfolio Assessment: Whose Work Is It? Issues in

the Use of Classroom Assignments for Accountability. Educational

Assessment, 5(1), 41-55.

Gebril, A., & Plakans, L. (2013). Toward a Transparent Construct of Reading-to-Write

Tasks: The Interface Between Discourse Features and Proficiency. Language

Assessment Quarterly, 10, 9-27.

Green, A. (2007). IELTS Washback in Context: Preparation for academic writing in

higher education. Cambridge: Cambridge University Press.

Green, R. (2013). Statistical Analyses for Language Testers. Basingstoke: Palgrave

Macmillan.

Hamp-Lyons, L., & Kroll, B. (1997). TOEFL 2000 - Writing: Composition,

Community, and Assessment. Princeton, NJ: ETS.

Hartill, J. (2000). Assessing Postgraduates in the Real World. In G. M. Blue, J. Milton,

& J. Saville (Eds.), Assessing English for Academic Purposes (pp. 117-130).

Bern: Peter Lang.

Hartnett, R., & Willingham, W. (1980). The Criterion Problem: What Measure of

Success in Graduate Education? Applied Psychological Measurement, 4, 281-

291.

Hayatt, D., & Brooks, G. (2009). Investigating stakeholders' perceptions of IELTS as

an entry requirement for higher education in the UK. IELTS Research Reports,

10, Report 1.

62

Henning, G. (1987). A guide to language testing: development, evaluation, research.

Cambridge, MA: Newbury House Publishers.

Howell, B., Nathan, P., Schmitt, D., Sinclair, C., Spencer, J., & Wrigglesworth, J.

(2012). BALEAP Guidelines on English Language Tests for University Entry.

Retrieved from http://www.baleap.org/projects/testing-working-party

Hughes, A. (2003). Testing for Language Teachers. Cambridge: Cambridge

University Press.

IELTS. (2011). Guide for educational institutions, governments, professional bodies

and commercial organisations. Retrieved from www.ielts.org

IELTS. (2012). Ensuring quality and fairness in international language testing.

Retrieved from http://ielts.org/pdf/IELTSQualityFairnessBrochure_2012.pdf

IELTS. (2013). Band descriptors, reporting and interpretation. Retrieved from

http://www.ielts.org/researchers/score_processing_and_reporting.aspx

Ingram, D., & Bayliss, A. (2007). IELTS as a predictor of academic language

performance. IELTS Research Reports, 7, Report 3.

International Study and Language Institute. (2013). TEEP: Candidate's Handbook.

Retrieved from

http://www.reading.ac.uk/web/FILES/ISLC/TEEP_candidate%27s_handbook.

pdf

Jordan, R. (1997). English for Academic Purposes: A Guide and Resource Book for

Teachers. Cambridge: Cambridge University Press.

Kerstjens, M., & Nery, C. (2000). Predictive Validity in the IELTS Test: A Study of

the Relationship Between IELTS Scores and Students' Subsequent Academic

Performance. IELTS Research Reports, 3, Report 4.

Knoch, U. (2011). Rating scales for diagnostic assessment of writing: What should

they look like and where should the criteria come from? Assessing Writing, 16,

81-96.

Kroll, B. (1990). What does time buy? ESL student performance on home versus class

compositions. In B. Kroll (Ed.), Second Language Writing: Research Insights

for the Classroom (pp. 140-154). Cambridge: Cambridge University Press.

Larson-Hall, J. (2010). A Guide to Doing Statistics in Second Language Research

Using SPSS. New York: Routledge.

Lee, Y. (2005). A Summary of Contsruct Validation of an English for Academic

Purposes Placement Test. Spaan Fellow Working Papers in Second or Foreign

Language Assessment, Volume 3, 113-131.

63

Leki, I. (2011). Learning to write in a second language: Multilingual graduates and

undergraduates expanding genre repertories. In R. M. Manchon (Ed.),

Learning-to-Write and Writing-to-learn in an Additional Language (pp. 85-

109). Amsterdam: John Benjamins Publishing Company.

Lewkowicz, J. A. (1997). The integrated testing of a second language. In C. Clapham,

& D. Corson (Eds.), Encyclopaedia of language and education, vol. 7:

Language testing and assessment (pp. 121-130). Dortrecht, The Netherlands:

Kluwer.

Long, M., & Norris, J. (2009). Task-based teaching and assessment. In K. Van den

Branden, M. Bygate, & J. Norris (Eds.), Task-Based Language Teaching: A

Reader (pp. 135-142). Amsterdam: John Benjamins Publishing Company.

McNamara, T. (1996). Measuring Second Language Performance. Harlow: Longman.

McNamara, T. (1997). Performance Testing . In C. Clapham, & D. Corson (Eds.),

Encyclopedia of Language and Education, Volume 7: Language Testing and

Assessment (pp. 131-139). Dordrecht: Kluwer Academic Publishers.

Messick, S. (1980). Test Validity and the Ethics of Assessment. American

Psychologist, 35(11), 1012-1027.

Messick, S. (1995). Standards of validity and the validity of standards in performance

assessment. Educational Measurement: Issues and Practice, 14(4), 5-8.

Moore, T., & Morton, J. (2007). Authenticity in the IELTS Academic Module Writing

test: a comparative study of Task 2 items and university assignments. In L.

Taylor, & P. Falvey (Eds.), IELTS Collected Papers: Research in speaking and

writing assessment (pp. 197-248). Cambridge: Cambridge University Press.

Nesi, H. (2002). A study of dictionary use by international students at a British

university. International Journal of Lexicography, 15(4), 277-305.

Niño, A. (2009). Machine translation in foreign language learning: language learners'

and tutors' perceptions of its advantages and disadvantages. ReCALL, 21(2),

241-258.

Norris, J. (2002). Interpretations, intended uses and designs in task-based language

assessment. Language Testing, 19(4), 337-346.

Oliver, R., Vanderford, S., & Grote, E. (2012). Evidence of English language

proficiency and academic achievement of non-English-speaking background

students. Higher Education Research and Development, 31(4), 541-555.

O'Loughlin, K. (2011). The Interpretation and Use of Proficiency Test Scores in

University Selection: How Valid and Ethical Are They? Language Assessment

Quarterly, 8(2), 146-160.

64

O'Loughlin, K., & Arkoudis, S. (2009). Investigating IELTS exit score gains in higher

education. IELTS Research Report, Volume 10.

Pallant, J. (2007). SPSS Survival Manual. Maidenhead: Open University Press.

Plakans, L. (2008). Comparing composing processes in writing-only and reading-to-

write test tasks. Assessing Writing, 13, 111-129.

Rees, J. (1999). Counting the Cost of International Assessment: why universities may

need to get a second opinion. Assessment and Evaluation in Higher Education,

24(4), 427-438.

Reinders, H., Moore, N., & Lewis, M. (2008). The International Student Handbook.

Basingstoke: Palgrave Macmillan.

Robinson, P., & Ross, S. (1996). The Development of Task-Based Assessment in

English for Academic Purposes Programs. Applied Linguistics, 17(4), 455-476.

Routes into university and higher education. (2013). Retrieved from

http://www.nidirect.gov.uk/routes-into-university-and-higher-education

Salmons, J. (2008). Expect Originality! Using Taxonomies to Structure Assignments

that Support Original Work. In T. Roberts (Ed.), Student Plagiarism in an

Online World (pp. 208-226). Hershey, PA: Information Science Reference.

Salvia, J., & Ysseldyke, J. (2001). Assessment. Boston, MA: Houghton Mifflin .

Schmitt, D. (2012). EAP Assessment in the UK. Retrieved from

http://www.baleap.org/projects/testing-working-party

Shohamy, E. (2001). Democratic assessment as an alternative. Language Testing,

18(4), 373-391.

Skehan, P. (1998). A Cognitive Approach to Language Learning. Oxford: Oxford

University Press.

Slaght, J., & Howell, B. (2007). TEEP: A Course-driven Assessment Measure. In O.

Alexander (Ed.), New Approaches to Materials Development for Language

Learning: Proceedings of the 2005 joint BALEAP/SATEFL conference (pp.

253-264). Bern: Peter Lang AG.

Taylor, L., & Geranpayeh, A. (2011). Assessing listening for academic purposes:

Defining and operationalising the test construct. Journal of English for

Academic Purposes, 89-101.

The TOEFL iBT Test. (2013). Retrieved from TOEFL: http://www.ets.org/toefl

65

UK Border Agency. (2013). English language ability. Retrieved from

http://www.ukba.homeoffice.gov.uk/visas-immigration/studying/adult-

students/can-you-apply/english-language/

UK Council for International Student Affairs. (2013). International student statistics:

UK higher education. Retrieved from http://www.ukcisa.org.uk

Weigle, S. C. (2002). Assessing Writing. Cambridge: Cambridge University Press.

Weigle, S. C. (2004). Integrated reading and writing in a competency test for non-

native speakers of English. Assessing Writing, 9, 27-55.

Weir, C. (2005). Language Testing and Validation. Basingstoke: Palgrave Macmillan.

Wigglesworth, G. (2008). Task and performance based assessment. In E. Shohamy, &

N. Hornberger (Eds.), Encylopedia of Language and Education, 2nd Edition,

Volume 7: Language Testing and Assessment (pp. 111-122). Boston, MA:

Springer Science+Business Media LLC.

Woodrow, L. (2006). Academic Success of International Postgraduate Education

Students and the Role of English Proficiency. University of Sydney Papers in

TESOL, 1, 51-70.

Yen, D., & Kuzma, J. (2009). Higher IELTS Score, Higher Academic Performance?

The Validity of IELTS in Predicting the Academic Performance of Chinese

Students. Worcester Journal of Learning and Teaching (3), 1-7.

66

8. APPENDICES

Appendix I: Sample LS1 Reading & Writing Exam

Language for Study Level 1

End of Term Writing Examination Question Book

Time: 60 minutes

Instructions to students:

1. Read all instructions very carefully.

2. At the start of the examination, you should have: i. this question book ii. an answer book.

3. Read all of the exam questions carefully before you begin to write.

4. Write your answer(s) in the answer book.

5. Write in black or blue ink.

6. Cross out any mistakes.

7. You cannot use a dictionary.

8. If you have questions at any time during the examination, you must raise your hand and wait for an invigilator. Do not attempt to communicate, by any means, with any other candidate at any time before or during the examination.

9. You may not leave the examination hall during the first hour. If the examination is less than one hour, you must stay for the entire duration of the exam (toilet breaks are permitted if necessary).

10. At the end of the exam, do not speak or leave your seat until you have handed both the question book and the answer book and any additional sheets of paper you may have used to an invigilator, and you have been told by an invigilator that you may leave the room.

67

TASK:

You are a student at a UK university. In one of your core modules you have recently been

discussing issues related to Education and technology. Read the notes you have made below

(page 2) and the related quotations (page 3). Then using the information appropriately,

write a discussion essay for your tutor in response to the following statement:

“In the 21st century, distance learning is a more attractive

option than attending a university in person.”

In your essay you must discuss the statement above, presenting arguments for and against

it. You should support your ideas by making reference to the reading extracts provided on

page 3. You should not copy sentences directly from the extracts but paraphrase the

information using your own words. You should provide in-text citations every time you use

the provided sources. You should discuss both positive and negative aspects of the essay

topic and use ideas from at least 2 of the extracts provided. You may also use your own

ideas.

You have 60 minutes to complete this task. You should write a maximum of 250 words.

NOTES:

Issues about Distance Learning

Positive Aspects

- More choice and flexibility

Negative Aspects

- Lack of self-discipline may lead to failure

68

Students have the opportunity to choose from various schools, programs and courses which are not available in the area where they live. This is especially beneficial for those who live in rural areas that only have one or two educational facilities, which may offer limited course and program options for students. Furthermore, the flexibility offered by online learning benefits not only undergraduate students, but also individuals who already have full-time jobs or family commitments.

Doe, J. (2012). Online Education. Retrieved from http://seacstudentweb.org/top-benefits-of-

online-education.php

The flexibility advantage of distance learning does not reduce the most

significant demand on a student’s time because distance learning still requires

students to log on, study, do homework and write exams. The drop-out rate for

online students is higher partly due to a common misunderstanding about the

amount of time required of a distance student. Moreover, the traditional

university environment in which inquiring students learn through interaction

with other inquiring students is replaced by an uninspiring electronic

environment experienced in the isolation of the home.

Simmons, J. R. (2001). Distance Learning: Education or Economics? International Journal of

Value-Based Management, 14,157–169

Online technologies are attractive because they provide the opportunity to

create rich learning environments consisting of multimedia resources and

facilities for communication and interaction. What online technologies do,

in addition to common access to learning resources, is promote student–

teacher and student–student interaction whatever the participants‟

location.

Calvert, J. (2005). Distance Education at the Crossroads. Distance Education,26(2), 227–238

***This is the end of the exam paper***

69

Appendix II: Sample LS1 Listening & Speaking Exam

LANGUAGE FOR STUDY 1 END OF TERM SPEAKING EXAMINATION (SAMPLE SET)

EXAMINER’S VERSION

Task Instructions:

This exam task consists of two parts:

In the first part of the task the candidate listens to two speakers and answers questions about what they have heard (2 questions for each speaker). (2- 3 minutes)

In the second part of the task, the examiner will ask the candidate up to 3 more questions about the topic of healthy eating. The candidate needs to answer all questions they are asked with a brief response. This part of the exam should last 2 – 3 minutes.

Total time: approx 10 minutes including turnaround time and completing mark sheets.

Part 1.

Give the candidate their tasks, some paper to make notes on and ask them to read the instructions. Explain what will happen in the exam and give them 30 seconds to go through the questions for the listening. Before you play the recordings, ask the candidate if they understand what they need to do, explain again if necessary. Point to the paper provided and encourage the candidate to make notes whilst they listen.

Questions for the extracts:

Speaker 1:

1. Why do supermarkets sell fat-free foods?

2. Why are fat-free foods not always healthier than full-fat products?

Speaker 2:

1. Why do most people nowadays choose fat-free products?

2. What trend in people’s weight has been observed?

70

Ask the candidate if they are ready to listen. Then play the recordings of both speakers once only. Do not give the candidate extra time between the speakers.

Audio scripts:

Speaker One:

(male/female lecturer’s voice): Have you noticed how supermarket shelves have gradually been stacked with more and more ‚fat free‛ or ‚low-fat‛ food? Almost every food product comes in two versions nowadays: every regular, that is, full-fat product also has a low-fat or fat-free equivalent. You might ask why? Well, the answer is simple, the food industry has responded to their customers’ demand. These customers seem to believe that eating ‘low-fat’ or ‘fat-free’ products is healthier than eating regular, full-fat foods. The problem is, however, that this is not exactly true. Let me explain why. You see, to claim that a product is 'low fat’, the amount of fat in this product must be at least 25% lower than in the standard product. This fat is replaced with other ingredients, which are not necessarily healthier than fat! For example, fat is replaced with sugar or, worse still, with some chemicals which improve the taste of the product. And so, in result, low-fat products have got twice as many ingredients as the regular versions and can be much worse for our diet!

Speaker Two: (male/female lecturer’s voice): Recent research shows that most people nowadays opt for low-fat or fat-free products. The cause of this trend is relatively easy to understand. People believe that if they buy fat-free foods then they should not gain weight or should be able to lose weight because they are eating less fat. Although it seems logical, it is, unfortunately, not true at all and here's why. The fact is that there is no connection between eating fat and the fat produced by your body. The body can take any type of food and turn it into body fat. For example, if you eat 2 lbs. of sugar every day then you’ll probably gain some fat tissue after a while – yet sugar has zero fat in it! So, eating 2 lbs of sugar every day is a ‚fat free‛ diet, but yet you’d still gain fat. It’s simple, eating fat-free products doesn’t mean you will not put on weight! This brings me to my second point, namely: what does this all tell you about fat-free products? You see, what happens is that most people hear the term ‘fat-free’ and see it as a green light to eat as much of it as they want, thinking they will not put on extra weight. And that’s why despite the fact that we have all these low-fat and fat-free foods, we are fatter than at any time in history and the trend is only continuing to go higher!

Stop the recording. Ask the candidate the listening questions. Elicit an answer after each question.

71

PART 2. 1. Choose three questions from the list below. Ask ONE question at a time. You can

repeat the question if the candidate asks you to. This part of the exam should last 2 – 3 minutes.

1. What kind of food is popular in your country? 2. How important is it to eat healthily? 3. What is your typical diet? 4. How important is it to teach children about healthy food and

who’s responsibility is it (school, parents)? 5. Where can people find information about and get advice on

healthy eating in your country? 6. How different is British food to food from your country?

2. After the candidate has answered the last question (or after the 3 minutes have passed), thank the candidate and inform them this is the end of the exam and they can leave the examination room. Collect all the notes and exam materials from the candidate before they leave the room.

5. Fill in the candidate’s mark sheet using the LS Speaking Descriptors.

72

Appendix III: Sample SS1 Essay Task

SS1 (Foundation Level- 2 Term)

Essay (60%)

(Autumn 2012)

This assessment is an essay writing task. The following gives you more information about the task.

Question:

Outline the factors which could contribute to low health expectancy in

developed countries. Discuss possible solutions to reduce this problem.

In your essay, you should refer to a number of sources and can include your own ideas on this topic (you can use the sources provided in Unit 4 of the SS1 textbook, but must also include a minimum of two other relevant sources as well).

You must write in paragraphs and make sure you provide in-text citations for all the sources you refer to in your essay and a list of references at the end of your essay.

The essay should be 750 words.

What you will be assessed on:

The relevance of your ideas to the essay task. The structure of your essay (how well you planned your essay and how well

you link your ideas within and between paragraphs). How well you support your ideas using sources and how accurate your in-text

referencing and final references are. How appropriate and accurate your use of English language is, especially

register and style.

73

Submission checklist:

Checked your essay several times for accuracy Typed and printed essay Used Times New Roman or Arial font type, size 12 Double spaces between lines. Printed single-sided only.

Included a title page with the following information:

i. Module Code (e.g. FC501 2T)

ii. Class/Group: (e.g. Group A)

iii. Module Title (Skills for Study 1)

iv. Assessment Title (e.g. Essay, Presentation etc.)

v. Assignment Title: (e.g. Discuss.......)

vi. Tutor Name: (name of tutor)

vii. Student ID Number: (please add your ID number only and

NOT your name)

viii. Date of Submission: (date)

ix. Word count (the number of words YOU used)

Used a header on each page of your assignment with your ID number and module code (FC501 2T).

Numbered all pages. Stapled all pages together Printed and submit two copies of your assignment. Filled in the official submission form, clearly stating: essay title, your name and

ID number, your tutor’s name. Kept the receipt for your submission. Sent a copy of your assignment to the Turnitin class set up by your tutor.

Failure to follow these guidelines could result in you losing 5 marks.

Submission Deadline:

74

Appendix IV: Sample SS1 Presentation Task

SS1 (Foundation Certificate Level)

2 Term

Oral Presentation (40%)

(Autumn 2012)

You will work with one other student to prepare a presentation using the following title.

Outline the factors which could contribute to low health expectancy in

developed countries. Discuss possible solutions to reduce this problem.

You need to:

1- Discuss what you are trying to persuade your audience of. 2- Work with your partner to create a PowerPoint presentation. 3- Deliver the presentation with your partner.

What you will be assessed on:

The way you deliver your presentation (including how well you cooperate with your partner and how accurate and appropriate is your use of English language)

The content of your presentation and the design of the PowerPoint slides to support what you say.

How well you organize your ideas and link/signpost them.

How well you present your thesis and support it using data or examples in appropriate detail.

How well you deal with questions from the audience.

Further Guidelines:

Presentations must be between 6 and 8 minutes. Up to 3 additional minutes will be allowed for answering questions from the audience.

Any sources you use to support points should be acknowledged.

Submission (Presentation) Date:

75

Appendix V: Box-and-whisker plots of frequency distribution output: IELTS

Overall Score and Final Academic Score

Figure 1. Box-and-whisker plot of frequency distribution output: IELTS Overall Score

Figure 2. Box-and-whisker plot of frequency distribution output: Final Academic

Score

76

Appendix VI: Histograms of IELTS and Final Academic Score distributions

Figure 1. Histogram of IELTS score distribution

Figure 2. Histogram of Final Academic score distribution

77

Appendix VII: Normal and Detrended Q-Q plots of IELTS Overall and Final

Academic Score

Figure 1. Normal Q-Q plot of IELTS Overall score

Figure 2. Detrended Q-Q plots of IELTS Overall score

78

Figure 3. Normal Q-Q plot of Final Academic Score

Figure 4. Detrended Q-Q plots of Final Academic Score

79

Appendix VIII: Box-and-whisker plots of frequency distribution output: LS1

Reported Overall Score and Final Academic Score

Figure 1. Box-and-whisker plot of frequency distribution output: LS1 Reported

Overall Score

Figure 2. Box-and-whisker plot of frequency distribution output: Final Academic

Score

80

Appendix IX: Normal and Detrended Q-Q plots of LS1 Reported, LS1 R&W,

LS1 S&L, and Final Academic Score

Figure 1. Normal Q-Q plot of LS1 Reported

Figure 2. Detrended Q-Q plot of LS1 Reported

81

Figure 3. Normal Q-Q plot of LS1 R&W

Figure 4. Detrended Q-Q plot of LS1 R&W

82

Figure 5. Normal Q-Q plot of LS1 L&S

Figure 6. Detrended Q-Q plot of LS1 L&S

83

Figure 7. Normal Q-Q plot of Final Academic Score

Figure 8. Detrended Q-Q plot of Final Academic Score

84

Appendix X: Box-and-whisker plots of frequency distribution output: SS1

Overall Score and Final minus SS1 (F-SS1)

Figure 1. Box-and-whisker plot of frequency distribution output: SS1 Overall Score

Figure 2. Box-and-whisker plot of frequency distribution output: Final minus SS1 (F-

SS1)

85

Appendix XI: Normal and Detrended Q-Q plots of SS1 and Final minus SS1

Figure 1. Normal Q-Q plot of SS1

Figure 2. Detrended Q-Q plot of SS1

86

Figure 3. Normal Q-Q plot of Final minus SS1

Figure 4. Detrended Q-Q plot of Final minus SS1

nikolay gochev

Documents