chapter 4 research design - repository.tufs.ac.jp

17
CHAPTER 4 RESEARCH DESIGN In the previous chapters, ways of approaching how read defined from the perspective of test item specifications wer it has been examined and emphasized that, in investigating t with its relation to the latent structure of reading ability, th is on the“product”of FL reading as a result of FL reading“p Chapter 3 had described a way in which a constnlct of re defined by developing test items that elicit certain types o takers’reading comprehension. Reading“competence”was te constitutes a major part of reading“performance”, and i construct fbr the purpose of reading test item developmen although a test item is defined to be a tool which elicits a r performance should be accepted as something that all inferences and make generalizations about what sort of readin might be able to do. Furthermore, this should be conside interaction of his competence and the context rather than c holistic and content-representative. Tb continue along th the significance of specifying the components of a test i particular, in operationalizing the reading construct to be was f[耐her explored by reflecting on item diffriculty, or a quan item. The discussion had concluded in suggesting a possibili question type of a test item and its difficulty, which provid questions to the present research. 4.1Research questions Research Question 1: Is it valid to employ‘question types, as a prime compone items used in eliciting test takers, L2 reading performanc 46 東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Upload: others

Post on 05-Jun-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

CHAPTER 4 RESEARCH DESIGN

     In the previous chapters, ways of approaching how reading ability could be

defined from the perspective of test item specifications were explored. In Chapter 2,

it has been examined and emphasized that, in investigating the nature of reading test

with its relation to the latent structure of reading ability, the scope ofthe present study

is on the“product”of FL reading as a result of FL reading“process”. Furthermore,

Chapter 3 had described a way in which a constnlct of reading ability could be

defined by developing test items that elicit certain types of reading product in test

takers’reading comprehension. Reading“competence”was termed to be a facet that

constitutes a major part of reading“performance”, and in defining the reading

construct fbr the purpose of reading test item development, it was proposed that,

although a test item is defined to be a tool which elicits a reading performance, that

performance should be accepted as something that allows the testers to draw

inferences and make generalizations about what sort of reading activities the test taker

might be able to do. Furthermore, this should be considered analytically as an

interaction of his competence and the context rather than considering it as something

holistic and content-representative. Tb continue along the same lines of apProach,

the significance of specifying the components of a test item,‘‘question types”in

particular, in operationalizing the reading construct to be tested was discussed. This

was f[耐her explored by reflecting on item diffriculty, or a quantitative aspect of a test

item. The discussion had concluded in suggesting a possibility of a link between the

question type of a test item and its difficulty, which provides the fbllowing research

questions to the present research.

4.1Research questions

Research Question 1:

Is it valid to employ‘question types, as a prime component that constructs test

items used in eliciting test takers, L2 reading performances?

46

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 2: CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

     What are the factors that constitute the L2 reading performances of leamers of

English at secondary education in Japan, when they are extracted from factor anal)rtic

studies of reading products elicited using reading test items? Would they differ

across learners with different reading abilities?

     In an attempt to come up with a test item specification that effectively

operationalizes different reading performances to be tested, inspired by Negishi

(1996)and Wada(2003), the present study proposes the‘question typ♂of a test item

to be a prime component to constitute such a丘amework. At the same time, however,

because Negishi(1996)and Wada(2003)had not accommodated the interactions of

these constructing components with the latent reading structure of test takers, an

attention will be rendered to this aspect in much greater depth, as it is possible that

the prime factors could change in accordance with the test takers’reading abilities.

Research Question 2:

Is it valid to assume a certain relationship betWeen question types and item

difficulty in eliciting test takers, L2 reading performances?

    Is the item difficulty of a test item, calibrated using Item Response Theory,

affected by its question type? If so, how? Wbuld this relationship differ across

learners with different reading abilities?

     With an intere st in suggesting the facets of a reading te st item that would allow

the writers of test items to predeterrnine the difficulty of a test item, the present study

investigates the possibility of a link between the item diflriculty of a test item and its

question type. Attention will also be given to cases with different abilities of test

takers to see if the orders of perceived dif6culties across different question types

differ according to the different ability groups of test takers.

47

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 3: CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

42Data Collection

4.2.1Subjects

     Asample of 8301earners of English from senior high school and university in

Japan had participated in the main part of the present study. Of these,280 were

third-year high school students and 550 were first-year undergraduate students in

皿lverslty.

     The maj ority of high school students had five years of English education under

the Course of Study provided by the Ministry of Education, Culture, Sports, Science

and Technology in English in a foreign language environment. They were told that

the test was administered to collect data on individual’s English proficiency. The

students had five English classes in a week;nothing was done in the classroom that

would help the students to prepare fbr the tests administered in this study.

     For the university students, the circumstances were the same as high school

students except that the duration of time English was leamed was mostly six years.

All of the皿iversity students maj ored in one foreign language other than English and

were given the test early in April, immediately after they had entered university, as a

placement test fbr their English classes that were prerequisite in the university

curriculum. This was to ensure that the test takers did not have any special

knowledge of English or of any other academic field that would distort the outcome

of data collections.

     There were some variations in both high school and university students’

background of how and how long English was learned(e.g. students who had

overseas experiences), however, the variation in the number of years they had spent

time abroad or the intensity of how much English they had leamed were so great that

it was not possible to come up with any generalizable criterion fbr omitting the scores.

Moreover, it could be assumed that tho se variations would be an inherent factor in

leamers’reading ability that enables them to score high on the test, so the present

author had decided to disregard such factors in the process of data collection as long

as it did not affect the distribution of scores too greatly.

48

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 4: CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

4.2.2 Materials

     The two sets of test instrument were employed in the main study.

4.2.2.1Test Set/l

     Test Set A(presented in Appendix A)consists of nine passages, each passage

with three multiple-choice test items(one correct option and three distracters

provided)to be responded on the base of its comprehension. These nine passages

were selected after an item selection was done in the pilot study, providing 27 reading

test items. The features of these nine passages are as follows:

Table 4-1 The features of passages employed in Test Set A

TEXT Item# REase Gr Level Words#1 1-3 56.8 8.7 952 4-6 66.4 7.3 1093 7-9 55.2 10 1085 13-15 65.2 9.6 1106 16-18 64.8 7.5 957 19-21 65.7 7.6 101

8 22-24 68.1 7.4 1049 25-27 55.8 8.4 97

10 28-30 57.1 9.5 103

61.68 8.44 102.44

(TeXt 4, as well as ltem 10,11 and 12 are missing from the table because they were omitted after the

item SeleCtiOn.)

     All of the passages are taken from Reading Comprehension Section(advanced

level)of Global Test of English Comm皿ication(GTEC)developed by Benesse

Corporation. The pre sent author had determined GTEC to be an appropriate source of

reading texts since it was designed to test English proficiency of high-intermediate

leamers in senior high schools and皿iversities in Japan, which is at an equivalent

level of the subjects to be tested and also of what the Course of Study provided by

Ministry of Education, Culture, Sports, Science amd Teclmology aims for.

49

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 5: CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

     In Table 4-1,‘‘R. Ease”indicates the Flesch Reading Ease and‘‘Gr. Level”

indicates the Flesch-Kincaid Grade Leve1. They both indicate a readability index, a

means of describing how easily written materials could be read and皿derstood.

Although they employ the same core measures(word length and sentence length)to

calculate the index, they have different weighting factors, which sometimes create

incoherence in the outcome of calculations. The indices provided by the Flesch

Reading Ease indicates the easiness of reading a passage from the scale of zero to one

h皿dred, zero being the most difficult to one h皿dred being the easiest.

Flesch-Kincaid Grade Level expresses the readability in a grade level of US

educational system, making it easier to j udge the readability level of various books

and texts. Observing these indices fbr the nine passages used in Test Set A, the

present author assumes the diffriculty of passages were appropriate for the subj ects

and fbr the purpose of the present research(see 4.3.1 for fUrther explanations on how

the subj ect groups were predetermined for the main study).

     The number of words in each passage was co皿ted so as to regulate the

characteristics of each passage. The present author had selected passages that were

around 100 words in total, considering the time constraint of testing environments.

The numbers at the bottom indicate the means for each index.

     As fbr the three multiple-choice test items that were to be answered after

reading each passage, the present author had written the questions and four options・

The validity of which question type(see 3.4.2 fbr detailed explanations)each item

represented was checked by her colleagues(two teachers at a senior high school)and

their assessment had sufflcient correlation of.76. For the items where

disagreements were fbund, they were discussed and revised so that all three people

(the two colleagues and I)were satisfied with the decision.

      For each passage, the first item was written so that the question elicits a

“global-inferential”comprehension of the passage. These were the items numbered

1,4,7,13,16,19,22,25,and 28, and they asked fbr the main idea of the passage.

For example, item l of Test Set A(‘‘1.What is the main idea of this passage?”)can be

answered correctly if a test taker comprehends that the main idea in the passage is the

50

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 6: CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

growing seam in the seafloor of the Atlantic Ocean. The wording and phrases used

in each question may vary, but all nine questions(items 1,4,7,13,16,19,22,25, and

28)are made to elicit“global-inferentiar’type of reading.

     The second item was written so that the question asks fbr a‘‘local-literal 

comprehension. These were the items numbered 2,5,8,14,17,20,23,26, and 29,

and they asked fbr the information which is directly interpreted丘om a relatively

small amount of text source. With regard to the first passage which appears in Te st

Set A, item 2 is such test item. Item 2 requires a test taker to complete the sentence,

‘‘

Q.The speed at which the seafloor is spreading is_” The correct option‘‘(C)half

as fast as human fingemails grow,”can be chosen if the test taker can spot and

understand the last sentence in the passage,“This spreading occurs in half of a speed

of how fast fingernails grow,”as it i s, Without any fUrther inferring from the text.

     The last item was composed so that the question provokes a“local-inferentialうう

皿derstanding of the passage. These were items 3,6,9,15,18,21,27,30, and they

called for the information which could be obtained after making an inference from a

relatively small amount of text source. With regard to the first passage which

appears in Test Set A, item 3(‘‘3. The break-off of Pangaea started because...”)

requires such type of comprehension and asks fbr the cause of the growing seam in

the seafloor ofthe Atlantic Ocean. In order to choose the correct option,‘‘(B)aplate

started to develoP皿derwater and the land was separated,”atest taker needs to

understand the sentence,‘‘Since that time, the Atlantic Ocean has widened along a hot,

rock-producing seεm in the seafloor,e’and infer that the‘rock-producing searn’is the

cause the break-off of Pangaea.

     The three questions fbr each passage were asked so that the global-inferential

question would come first, the local-literal question second, and the local-inferential

third. The present author had chosen to provide them in this order because this is

the order in which the questions seem to appear in the reading sections of common

standardized proficiency tests, such as TOEFL or TOEIC.

     As fbr the time allocated to this test, because one class period in senior high

schools is usually 50 minutes,50 minutes was the maximum length of time allowed

51

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 7: CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

to implement Test Set A. Ideally, sufficient time should be given to the test takers

since the fbcus of the present study is in the test takers’‘power’, rather than their

‘speed’. Therefbre, special attention was given so that the test takers would be able

to complete the test set within the time allocated.

     Prior to the test implementation fbr the main study, a pilot test was carried out

in order to validate the test items developed by the procedures described above. The

subj ects were 143 students from a senior high school which is considered to be ofthe

equivalent academic level to the high school at which Test S et A was implemented in

the main study.

     The main interest in canying out the pilot test was to find and edit the test

items that exhibit problems with its item discrimination indices. Item discrimination

is‘‘the capacity of test items to differentiate among candidates possessing more or

less of the trait that the test is designed to measure.”(Davies et. al.1999:96) In

developing a test instrument, it is essential that the test items have high levels of item

discriminability to ensure a reliable measurement of test takers’ability. Items with

low item discrimination index are usually eliminated丘om a test or edited. In the

present study, item discriminability was calculated using classical test theory

(point-biserial correlation calculated by ITEMAN)due to the small number of

subj ects and items.

     In Table 4-2,‘‘PBs”indicates point-biserial correlation, and‘‘PCう’indicates the

percentage of test takers who correctly answered each item. Indices fbr

point-biserial correlation are used to indicate how well an item discriminates test

takers who are more capable with those who are not so capable. It is often defined

that point biserial correlations of.25 and above are acceptable(Henning 1987:53),

and most of the items surpassed this criterion. Percentage correct is used to show

how easy(or difficult)atest item is because the higher(lower)the percentage of

people who correctly answered a test item, the easier(more diffiT cult)atest item had

been perceived by the test takers.

      As it is apparent, items 10,11,12 were considered to be problematic because

they show negative or very low discrimination. These were the items provided fbr

52

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 8: CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

the sarne passage, so it could be presumed that the passage itself was problematic for

this level of test takers. For this reason, the present author had decided it best to

eliminate all three items along with the passage. Items 1,2,3,9and l 6 also had low

discriminability, so the present author had reviewed and revised each item. Test Set

Apresented in Appendix A is the final version of these items after the revision.(The item

numbers were left as they were when the test set was implemented in the main stUdy, and this

was announced orally to test takers by the proctors.)

Table 4-2 The discrimination indices of test items in the pilot version of Test Set A

ITEM# PBs PC1 0.05 0.46

2 0.13 0.43

3 0.21 0.33

4 0.51 0.8

5 0.49 0.7

6 0.39 0.4

7 0.35 0.76

8 0.42 0.64

9 0.01 0.13

10 一〇.02 0.19

ll 一〇.1 0.12

12 0.18 0.38

13 0.41 0.47

14 056 0.36

15 0.51 0.69

16 0.13 0.4

17 0.49 0.43

18 0.49 0.48

19 0.42 0.54

20 0.42 0.57

21 0.39 0.28

22 0.51 0.62

23 0.58 0.45

24 0.44 0.53

25 0.51 0.62

26 0.54 0.43

27 0.38 0.68

28 0.47 0.42

29 0.56 0.47

30 0.51 0.42

In order to compare the reading abilities of test takers who took this set of test

53

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 9: CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

and, more importantly, to observe the alteration of latent ability structure among test

takers with different reading abilities, items 1,2 and 3 reappear in Test Set B as items

1,2,and 3, items 7,8,9as items 4,5, and 6 and items 10,11, and 12 as items 7,8,

and 9. However, as it was stated in the previous paragraph, because items 10,11,

and 12 were omitted from Test S et A, items 7,8, and 9 had to be omitted from Test

Set B as well.

     As fbr the time allocated fbr the completion of the test, it was reported from the

teachers who had proctored for the pilot study that most of the test takers appeared to

have reached the last item of the te st, which proves that 50 minutes was a sufficient

time fbr the test takers in the present study.

4.2.2.2Test Set B

     Test S et B i s presented in Appendix B. In total, there are 27 test items in the

test set;nine passages are provided, each with three multiple-choice test items to test

test takers’comprehension. Each item has one correct option and three distracters.

These nine passages were selected after an item selection was done in the pilot study.

The features of these nine passages are presented in Table 4-3.

Table 4-3 The features of passages employed in Test Set B

TEXT 1tem# REase Gr Level Words#1 1-3 56.8 8.7 952 4-6 55.2 10 108

4 10-12 34.1 12 157

5 13-15 35.3 12 142

6 16-18 34.8 12 1607 19-21 37.7 12 160

8 22-24 38.9 12 152

9 25-27 33.4 12 155

10 28-30 33.6 12 151

40 11.4 142.22

(「reXt 3, as well as ltem 7,8and g are missing from the table because they were omitted after the

item seleCtion.)

Text l is the same passage as Text l in Test Set A, Text 2 is the same passage as

54

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 10: CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

Text 3 in Test Set A, and Text 3 is the same passage as Text 4 in Test Set A. This

was done to compare the reading abilities of test takers who took this set of test, Test

Set B, with the test takers who took the Test Set A and, in particular, to see if any

alteration would emerge with regard to test takers’latent ability stmcture among

different ability groups. The rest of the passages were taken丘om Reading

Comprehension S ection of TOEFL Test Preparation Kit〃Morkbook(ETS 1998). The

present author had determined TOEFL Test preparation material to be an appropriate

source of reading passages because, since TOEFL was designed to test English

proficiency of students who are seeking to study at an undergraduate or graduate level

in the English-speaking environment, the level of English proficiency required to

succeed in completing them would be the same as that of advanced learners in Japan,

which is at an equivalent level of the subj ects to be tested by Test S et B.

     In the Table 4-3,“R. Ease”indicates Flesch Reading Ease and‘‘Gr. Leverう

indicate s Flesch-Kincaid Grade Level. The number of words were counted so as to

regulate the characteristics of each passage. The present author had selected

passages that were around 150 words in total for Texts 4 to 10, considering the time

constraint of testing environments. The numbers at the bottom indicate the means

fbr each index.

     As fbr the three multiple-choice test items that were to be answered after

reading each passage, the present author had written the questions and fbur options・

These questions and options were examined fbr their validity by her two colleagues.

After each passage, a‘‘global-inferentia1”question,‘‘local-literalう’question, and

‘‘撃盾モ≠戟|inferential”question(see 3.4.2 fbr detailed explanations of‘question types’)

are presented in the same manner as these questions are presented in Test Set A.

Thi s means that, fbr each passage, a‘‘global-inferential” question is the first item that

comes after the passage, a‘‘local-literal”question the second, and a‘‘local-inferential”

que stion the last. Therefore, items numbered 1,4,10,13,16,19,22,25, and 28 are

‘‘№撃盾b≠戟|inferential’うquestion which asked fbr the main idea of the passage, items

numbered 2,5,11,14,17,20,23,26, and 29 are“local-literal”questions which asked

fbr the information which is directly interpreted from a relatively small amount of

55

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 11: CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

text source, and items 3,6,12,15,18,21,27,30 are‘‘local-inferentialううquestions

asked for the information which could be obtained after making an inference from

relatively small amo皿t of text source(see pP.51-52 fbr detailed explanation and

examples of how these questions were presented). The validity of which question

type each item represented was confirmed by the two colleagues who had worked on

the question types of Test Set A, and their correlation was.71. For the items where

disagreements were fbund, they were discussed and revised so that all three people

(the two colleagues and I)were satisfied with the decision.

     For Test Set B, the time allocated to the test was 50 minutes in order to parallel

Test Set A. In writing and revising Test Set B, special attention was also given so

that the test takers would be able to complete the test set within the time allocated.

     Prior to the test implementation fbr the main study, a pilot test was carried out

in order to validate the test items developed by the procedures described above. The

subjects were 156 students from the same皿iversity at which Test Set B was

implemented in the main study. They were of the same academic background as the

subj ects who had participated in the main study.

     The main interest in carrying out the pilot test was to find and edit the test

items that exhibit problems with its item discrimination indices. As it was done in

the pilot study for Test S et A, item discriminability was calculated using classical test

theory(point-biserial correlation calculated by ITEMAN)due to the small number of

subj ects.

     In Table 4-4,‘‘PBs”indicates point-biserial correlation fbr item discriminability,

and‘‘PC”indicates the percentage of test takers who correctly answered each item to

show item di伍culty. Items 7,8,9were automatically eliminated because they were

the same items as those eliminated from Test Set A(items 10,11, and 12). The

present author had originally intended to use these three items fbr level comparison

across different subject groups but decided to discard them fbr this reason and also

due to the time constraint expected in the testing environment. Furthermore, items l

and 2, which reveal low item discrimination in Table 4-4, were revised because they

were the items presented as items l and 2 in Test Set A and had also shown low item

56

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 12: CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

discrimination in the pilot test fbr Test Set A. The same was true fbr item 4 and 6

which were numbered 7 and 9 in Test S et A. Items 23 and 29 also had low

discriminability, so they were reviewed and revised accordingly. Test Set B which is

presented in Appendix B the final version after these revisions.(The item numbers

were left as they were when the test set was implemented in the main study, and this

was㎜o皿ced orally to test takers by the proctors.)

Table 44 The discrimination indices of test items in the pilot version of Test Set B

ITEM# PBs PC1 0.27 0.94

2 0.18 0.99

3 0.30 0.86

4 0.29 0.43

5 0.43 0.81

6 0.20 0.44

7 0.38 0.80

8 0.56 0.63

9 0.18 0.67

10 0.45 0.71

ll 0.39 0.84

12 0.49 0.57

13 0.42 0.84

14 0.54 0.31

15 0.41 0.36

16 0.30 0.36

17 0.33 0.63

18 0.32 0.21

19 0.20 0.97

20 0.28 0.91

21 0.33 0.81

22 0.23 0.65

23 0.15 0.36

24 0.42 0.51

25 0.47 0.52

26 0.27 0.40

27 0.36 0.22

28 0.29 0.91

29 0.18 0.84

30 0.33 0.75

As fbr the time allocated fbr the completion of the test, it was reported from the

57

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 13: CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

teachers who had proctored for the pilot study that most of the test takers appeared to

have reached the last item of the test, which proves that 50 minutes was a sufficient

time fbr the test takers in the present study.

4.2.3 Test Administration

     Test Set A and Test Set B were both administered in 50 minutes. Senior high

school students were given Test Set A. It was implemented as a reading proficiency

test in a 50-minute class period, proctored by the teachers who taught the class in the

regular lesson.

     For皿iversity students, the test was administered as a part of a placement test

fbr their required English classes which consisted of a listening comprehension

section and a reading comprehension section. They were given either Test Set A or

Test Set B, depending on the date they were taking the test. Those students who

took the test on the first day of the placement test were given the test which included

Test Set A as the reading comprehension section, and those who took the test on the

second day, Test Set B. The scores on the reading comprehension section of the test

were not counted in the placement itself because of the difference in difficulty

between the two test sets. In the first half of the testing time, students were given 50

items that tested their listening skills. In this part of the test, the time was regulated

by the listening material. At the end of this section, which was announced by the

listening material itself, students were told to begin the reading section. The

students were given 50 minutes fbr the reading section. The test was proctored by

the teachers who teach the required English classes.

     Both high school students and皿iversity students were asked to provide their

answers on mark-sheets. These mark-sheets were scored electrically on the

mark-sheet sca皿er.

4.3Data Analysis

4.3.1 Predetermining Ability Groups

58

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 14: CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

     Prior to the data analyses, three groups of different abilities were determined

based on the results of the data collection above. The three groups are:Group

A-Low, Group A-High, and Group B.

     Group A-Low and Group B were to represent the groups of test takers who

were responding to the items that had a difficulty that is equivalent to their reading

ability, and Group A-High to represent the test takers who were responding to the

items that were considered to have a difficulty lower than their reading ability. In

this way, the results of Group A-Low and Group A-High could be compared to

investigate the differences exhibited by test takers with different reading abilities

tackling the test items of the same difficulty. R耐hermore, the results of Group

A-Low and Group B were to be compared to observe the differences presented by test

takers with different reading abilities responding to the test items that had the

difficulty equivalent to their ability.

     Here, an explanation of what is meant by‘‘test takers with different reading

abilities responding to the test items that had the difficulty equivalent to their ability”

for Group A-Low and Group B and‘‘the te st takers who were re sponding to the items

that were considered to have the difficulty lower than their reading ability”fbr Group

A-High may be necessary.’In Item Response Theory(IRT), the theory on which the

calculation of item difficulty was based in the analyses of Section 5.3, the idea is to

find the relationship between the difficulty of a test item, the ability of a test taker,

and the probability of a test taker answering a test item correctly(Ohtomo 1996:69).

The difficulty of a test item is determined by its“item characteristic curve”, a graph

which is drawn after the calibration using logistic fUnction. On this graph, the point

where it meets where the probability of a person responding to that item is O.50(50%)

indicates the ability level of that person, the person whose probability of answering

that test item correctly is O.50, and that ability index is employed as the difficulty of

the test item. Therefbre, the index provided as‘‘theta”in Appendix C-1, D-1, and

E-1,indicates the ability level(from-3.O to 3.0)of a person whose probability of

responding to that item is O.50 and that also represents the difficulty of the test item.

This relationship between the ability of a test taker and the difficulty of a test item

59

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 15: CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

brings the present reader to characterize each subj ect group as having an ability that is

‘‘?曹浮奄魔≠撃?獅煤@to’うor‘‘higher than”the difficulty level oftest items.

     Originally, the present author had chosen to give Test Set A to high school

students and half of the university students, so that high school students would

represent Group A-Low and university students, Group A-High. Test Set B was

given to the rest of the university students to represent Group B. At this point, the

author had assumed that university students would possess higher ability in English

reading comprehension since they had had an extra year of English education along

with their preparatory learning experience fbr皿iversity entrance examinations.

However, this method of predetermining the ability groups did not fUnction for the

present study because, virtually, no difference could be fbund between the scores of

high school students and university students on Test Set A;the mean scores were 17.6

fbr the high school students and 17.9 fbr皿iversity students. One possibility which

could have caused this to hapPen is the fact that皿iversity students were given the

reading comprehension test after they had worked on the listening comprehension

section in the placement test. The cognitive load which was imposed on the test

takers while working on the listening comprehension could have exhausted them

cognitively and impeded their performances on the reading section, rendering the

result above. However, when the listening test material was evaluated, it was

determined that it did not appear to exhibit the diffriculty that would influence test

takers’performance in the latter section of the test. Therefbre, it was presumed that

there indeed was little difference in reading ability between high school students and

university students who were given Test Set A. For this reason, at this point, the

present author decided to look at the results of test takers who worked on Test Set A

as a whole, regardless of whether they were high school students or university

students, and predetermine the ability groups based on their test scores on Test Set A.

Adetailed description of how these groups were decided is presented in Chapter 5.

No change was made in predetermining Group B since the university students who

worked on Test Set B had averaged 16.3, which showed that the test takers who were

given Test Set B were advanced leamers who are at the same ability level as the

60

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 16: CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

reading ability expected to correctly respond to the test items in Test Set B.

4.3.2Statistical Proc…edures

     Three stati stical procedures were taken in order to analyze the data collected.

4.3.2.1Descriptive Statistics

     For each test set, mean and standard deviation were calculated. KR20 was

used to estimate the intemal consistency of each test set to ensure its reliability in

measuring students’reading ability. For the purpose of test validation, the facility

value(percentage correct)and discrimination index(point-biserial correlation)

calculated using Classical Test Theory by ITEMAN (Assessment Systems

Corporation)was also provided.

4.3.2.2Factor/lnalytic Studies

     In an attempt to come up with a test item specification that effectively

operationalizes different reading performances to be tested, the present study

proposes that the‘‘question typeう’of a test item could be a prime component to

constitute such a framework. In order to identifシthe components, or factors, that

constitute L2 reading performances,飴ctor analyses are done fbr the collected data in

each Test Set. The nature ofthe factors generated is consulted qualitatively.

   Full-information factor analysis was applied in factor analytic studies of both test

sets via TESTFACT 2(Scientific Software Intemational). Although some problems

are pointed out in using traditional factor analysis methods with binary data(i.e. items

that are scored dichotomously by judging right or wrong), fUll-information factor

analysis has been evaluated to accommodate such circumstances(Negishi 1996;Bock

1984).

4.3.2.2」rtem A n alyses

     To discover which facets of a reading test item would allow the writers of test

items to predetermine the diffriculty of a test item, the present study investigates the

61

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 17: CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

possibility of a link between the item difficulty of a test item and its question type.

For this purpose, test items are analyzed by consulting their item difficulty indices

calculated via Rasch Analysis using RASCAL(Assessment Systems Corporation)in

relation with question type. Other information in the final parameter estimates as

well as a raw score conversion table, an item by person distribution map, a test

characteristic curve, and a test information curve are provided in this section of

analysis.

62

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)