developing the academic collocation list (acl) – a corpus-driven and expert-judged approach

13
Developing the Academic Collocation List (ACL) A corpus- driven and expert-judged approach Kirsten Ackermann, Yu-Hua Chen * Pearson, 80 Strand, London, WC2R 0RL, UK Keywords: Collocation Corpus driven Expert judgment Vocabulary list EAP abstract This article describes the development and evaluation of the Academic Collocation List (ACL), which was compiled from the written curricular component of the Pearson Inter- national Corpus of Academic English (PICAE) comprising over 25 million words. The development involved four stages: (1) computational analysis; (2) renement of the data- driven list based on quantitative and qualitative parameters; (3) expert review; and (4) systematization. While taking advantage of statistical information to help identify and prioritize the corpus-derived collocational items that traditional manual examination are unable to manage, we argue that only with human intervention can a data-driven collo- cation listing be of much pedagogical use. Focusing on lexical collocations only, we present a new Academic Collocation List compiled using a mixed-method approach of corpus statistics and expert judgement, consisting of the 2,468 most frequent and pedagogically relevant entries we believe can be immediately operationalized by EAP teachers and students. By highlighting the most important cross-disciplinary collocations, the ACL can help learners increase their collocational competence and thus their prociency in aca- demic English. The ACL can also support EAP teachers in their lesson planning and provide a research tool for investigating academic language development. Ó 2013 Elsevier Ltd. All rights reserved. 1. Introduction The Academic Word List (Coxhead, 2000) is arguably the most widely used EAP word list nowadays taking corpus fre- quency into account. As the research interest in corpus linguistics has gradually shifted towards word co-occurrence rather than single words (see Granger & Meunier, 2008; Schmitt, 2004; Stubbs, 2001; Wray, 2008), there have been more in- vestigations of recurrent word combinations in academic prose using frequency and dispersion parameters (e.g. Biber, Conrad, & Cortes, 2004; Chen & Baker, 2010; Hyland, 2008). To the best of our knowledge, however, corpus analysis was not applied to creating lists of multi-word units for EAP pedagogy until recently when Durrant (2009) looked into the viability of an academic collocation list and Simpson-Vlach and Ellis (2010) an academic formulas list (AFL). Both lists present corpus- derived lexis across academic disciplines. The former covers the most frequent two-word collocations retrieved using sta- tistical information to determine the strength of word co-occurrence while the latter combines automated extraction of recurring word sequences and expert judgement to identify pedagogically useful formulaic sequences (3-, 4-, and 5-grams) for EAP. In contrast to the corpus-driven approach described above, the traditional approach in collocation research often relies on expert intuition to identify phraseological units and is hence also known as the corpus-based approach if corpus data are used * Corresponding author. E-mail addresses: [email protected] (K. Ackermann), [email protected] (Y.-H. Chen). Contents lists available at ScienceDirect Journal of English for Academic Purposes journal homepage: www.elsevier.com/locate/jeap 1475-1585/$ see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.jeap.2013.08.002 Journal of English for Academic Purposes 12 (2013) 235247

Upload: yu-hua

Post on 23-Dec-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Developing the Academic Collocation List (ACL) – A corpus-driven and expert-judged approach

Journal of English for Academic Purposes 12 (2013) 235–247

Contents lists available at ScienceDirect

Journal of English for Academic Purposes

journal homepage: www.elsevier .com/locate/ jeap

Developing the Academic Collocation List (ACL) – A corpus-driven and expert-judged approach

Kirsten Ackermann, Yu-Hua Chen*

Pearson, 80 Strand, London, WC2R 0RL, UK

Keywords:CollocationCorpus drivenExpert judgmentVocabulary listEAP

* Corresponding author.E-mail addresses: [email protected] (K

1475-1585/$ – see front matter � 2013 Elsevier Ltdhttp://dx.doi.org/10.1016/j.jeap.2013.08.002

a b s t r a c t

This article describes the development and evaluation of the Academic Collocation List(ACL), which was compiled from the written curricular component of the Pearson Inter-national Corpus of Academic English (PICAE) comprising over 25 million words. Thedevelopment involved four stages: (1) computational analysis; (2) refinement of the data-driven list based on quantitative and qualitative parameters; (3) expert review; and (4)systematization. While taking advantage of statistical information to help identify andprioritize the corpus-derived collocational items that traditional manual examination areunable to manage, we argue that only with human intervention can a data-driven collo-cation listing be of much pedagogical use. Focusing on lexical collocations only, we presenta new Academic Collocation List compiled using a mixed-method approach of corpusstatistics and expert judgement, consisting of the 2,468 most frequent and pedagogicallyrelevant entries we believe can be immediately operationalized by EAP teachers andstudents. By highlighting the most important cross-disciplinary collocations, the ACL canhelp learners increase their collocational competence and thus their proficiency in aca-demic English. The ACL can also support EAP teachers in their lesson planning and providea research tool for investigating academic language development.

� 2013 Elsevier Ltd. All rights reserved.

1. Introduction

The Academic Word List (Coxhead, 2000) is arguably the most widely used EAP word list nowadays taking corpus fre-quency into account. As the research interest in corpus linguistics has gradually shifted towards word co-occurrence ratherthan single words (see Granger & Meunier, 2008; Schmitt, 2004; Stubbs, 2001; Wray, 2008), there have been more in-vestigations of recurrent word combinations in academic prose using frequency and dispersion parameters (e.g. Biber,Conrad, & Cortes, 2004; Chen & Baker, 2010; Hyland, 2008). To the best of our knowledge, however, corpus analysis wasnot applied to creating lists of multi-word units for EAP pedagogy until recently when Durrant (2009) looked into the viabilityof an academic collocation list and Simpson-Vlach and Ellis (2010) an academic formulas list (AFL). Both lists present corpus-derived lexis across academic disciplines. The former covers the most frequent two-word collocations retrieved using sta-tistical information to determine the strength of word co-occurrence while the latter combines automated extraction ofrecurring word sequences and expert judgement to identify pedagogically useful formulaic sequences (3-, 4-, and 5-grams)for EAP.

In contrast to the corpus-driven approach described above, the traditional approach in collocation research often relies onexpert intuition to identify phraseological units and is hence also known as the corpus-based approach if corpus data are used

. Ackermann), [email protected] (Y.-H. Chen).

. All rights reserved.

Page 2: Developing the Academic Collocation List (ACL) – A corpus-driven and expert-judged approach

K. Ackermann, Y.-H. Chen / Journal of English for Academic Purposes 12 (2013) 235–247236

(see Granger & Paquot, 2008; Tognini-Bonelli, 2001). In conventional phraseology research (Cowie, 1981, 1994; Howarth,1998), collocation is generally seen as a continuum with varying degree of arbitrary restriction ranging from free combi-nations (e.g. write an essay), through restrictive collocations (e.g. conduct/do research instead of make research), to frozenexpressions (e.g. generally speaking). The arbitrary restriction could be associated with semantic opaqueness, degree of for-mulaicity or substitutability. However, even in free collocations, no single word can genuinely co-occur with any other wordwithout restriction. Take write an essay for example. In terms of syntax, if functioning as a transitive verb, write can only befollowed by a noun as an object. In terms of semantics, the literal meaning of write can only collocate with what the action ofwrite can produce such as essay, song or article. In other words, the traditionally defined ‘free collocations’ are still very muchrestricted by their semantic and/or syntactic environment. Hence, the boundaries between free combinations and restrictivecollocations are sometimes blurred and thus difficult to distinguish as both of them entail arbitrary restriction to variousextents. This also corresponds to the theory of ‘Lexical Priming’ proposed by Hoey (2005). Hoey argued that collocations offera clue of how language is structured and that words are ‘primed’ for use through our repeated encounters with them so thatour knowledge of a word, including the contexts and co-texts in which they occur, is the product of such encounters.

Although collocations can be instantly recognized by native speakers, they often remain difficult for learners to acquireand use properly. According to Nation (2001, p. 324) collocations contain ‘some element of grammatical or lexical unpre-dictability or inflexibility’. It is this feature that makes collocations challenging for L2 learners. Laufer (2011, pp. 30–31)summarizes the research findings from error analyses, elicitation and corpus analyses as follows: ‘the use of collocations isproblematic for L2 learners, regardless of years of instruction they received in L2, their native language, or type of task theyare asked to perform.’ Particularly the productive use of collocations poses great challenges to L2 learners. Biber and Conrad(1999) found that words with similar meaning are often distinguished by their preferred collocations, which reinforces theneed for a high level of collocational competence if speakers want to express themselves clearly and unambiguously. Bahnsand Eldaw’s study (1993), on the other hand, revealed that in a translation task the number of ‘collocation errors’made by L2speakers is twice as high as the errors in single lexical items. Biskup (1992) showed that learners use inappropriate synonymswhen producing collocations, and Nesselhauf (2005) points out that 50% of collocation errors are due to mother tongueinterference. Finally, Cobb (2003) found that evenwhen learners use collocations correctly they over-rely on a small numberof collocations. Yet by using a less appropriate collocate, a non-native speaker will sound unnatural or may even becomeunintelligible among speakers of the target language. Hence if learners aim for advanced proficiency, achieving a high level ofcollocational competence is essential.

The aforementioned research reveals the importance collocations play as well as the challenges they pose to L2 learners,thus indicating the central role they should play in language teaching and learning. Nation (2001, pp. 189–191) highlights thefact that academic collocations may ‘neither be sufficiently frequent in the language as awhole to be learnt implicitly nor partof the technical lexicon which is likely to be explicitly taught as part of subject courses’, which reinforces the need for such alisting. None of the existing EAP vocabulary lists, however, has met this need. As Durrant himself points out (2009, p. 163), themajority of the items in his list are grammatical collocations, i.e. one closed-class word (a.k.a. function word) such asprepositions or determiners plus one open-class word (a.k.a. content word) such as verbs or nouns (Benson, 1985). Forexample, the top five collocations in Durrant’s listing (2009, p. 166) – this study, associated with, based on, and respectively, dueto – do not appear to have attracted much research interest in conventional collocation studies, where collocations aremanually retrieved as opposed to relying on statistical measures. Instead, in the traditional approach, lexical collocations (thecombination of two open-class components, e.g. perform an experiment) are usually the target phraseological units underinvestigation (e.g. Granger, 1998; Laufer & Waldman, 2011; Nesselhauf, 2003). While it is true that some of the grammaticalcollocations in Durrant’s listing could contribute to revealing the patterns that might be overlooked otherwise, his listingbased on statistics alone, does not provide readily usable materials for EAP teaching and learning.

In the current study, we therefore argue that only with human intervention can a data-driven collocation listing be ofmuch pedagogical use while still taking advantage of statistical information to help identify and prioritize the corpus-derivedcollocational items that traditional manual examination is unable to manage. This is similar to the mixed-method approachadopted by Simpson-Vlach and Ellis (2010), who also combined statistical information and human judgement from EAPinstructors when compiling the Academic Formulas List. The difference is that in our study expert judgment is used not onlyfor the selection of lexical items for pedagogical purposes but also for the refinement for the final listing. It should be notedthat manual intervention is perhaps much more challenging when tackling collocations than it is when listing formulasbecause the latter are rather fixed expressions (e.g. in terms of, at the same time, from the point of) with little variation ofindividual components. Collocations, on the other hand, often contain inflective or positional variations (e.g. results obtained,broader contexts, achieving objectives), which poses the great challenge of how to collate these relevant forms and presentthem in a uniform and consistent way. This challenge can only be met with human intervention as there is currently noautomation that can simplify this process.

In terms of disciplinary variability, Hyland and Tse (2007) cast some doubt on the generalizability of the AWL andquestioned the assumption of a universal single core vocabulary list that can be applied to all fields of study. Although we alsotake a general approach following the tradition of the AWL, a specified frequency and dispersion threshold at least warrantsthat students, regardless of field of study, would bemore likely to encounter the lexical items on the lists than those outside ofthe lists. In addition, EAP lessons usually follow a general syllabus to cater for students of all subjects; thus there is a need forsuch cross-disciplinary lexical resources. Focusing on lexical collocations only, we present an Academic Collocation List (ACL),

Page 3: Developing the Academic Collocation List (ACL) – A corpus-driven and expert-judged approach

Table 1Fields of study and academic disciplines represented in the written curricular component of the corpus.

Applied sciences andprofessions (AS)

Humanities (HM) Social sciences (SS) Natural/Formal sciences (NS)

Discipline Tokens Discipline Tokens Discipline Tokens Discipline Tokens

Architecture 167,074 History 946,707 Anthropology 413,237 Earth sciences 1,343,723Business 1,644,180 Linguistics 855,128 Archaeology 184,089 Chemistry 1,502,277Education 405,202 Literature 1,562,046 Cultural studies 861,656 Physics 662,054Engineering 1,134,950 Arts 728,532 Gender studies 520,395 Computer

sciences1,124,097

Health sciences 1,429,679 Generalhumanities

627,951 Politics 1,090,800 Mathematics 295,565

Media studies 1,500,485 Philosophy 602,233 Psychology 1,560,745 Biology 858,597Law 1,962,002 Religion 198,165 Sociology 1,832,588 Ecology 239,787

Total 8,243,572(31%)

Total 5,520,762(21%)

Total 6,463,510(25%)

Total 6,026,100(23%)

K. Ackermann, Y.-H. Chen / Journal of English for Academic Purposes 12 (2013) 235–247 237

which consists of 2,468 most frequent (determined by corpus analysis) and pedagogically relevant (determined by expertjudgement) entries we believe can be immediately operationalized by EAP teachers and students.

2. Methodology

2.1. Corpus

The Academic Collocation List is derived from the written curricular component of the Pearson International Corpus ofAcademic English (PICAE). The corpus comprises over 37 millionwords of academic written and spoken texts from five majorEnglish-speaking countries, i.e. Australia, Canada, New Zealand, UK and USA. The corpus includes curricular English as foundin lectures, seminars, textbooks and journal papers. It also samples extracurricular English that students encounter fromuniversity administration to transcripts of broadcasts.

The written curricular component of the corpus, from which the ACL was compiled, comprises 25.6 million words fromjournal articles and textbook chapters covering 28 academic disciplines. Each of the four fields of study contains materialsfrom seven academic disciplines to ensure that the corpus is representative of the academic register. The number of tokensper academic discipline as well as the total number of tokens per field of study and its percentage are provided in Table 1.Whereas the main objective was to compile subcorpora for each field of study of similar size, less emphasis was placed onhaving similar-sized subcorpora for each academic discipline.

The ACL was developed in four stages. First, a computational analysis of the written curricular component was conducted.Second, manual refinement of the data-driven list based on quantitative parameters and target part-of-speech combinationswas carried out. This was followed by an expert review to judge whether each collocation is pedagogically relevant and asystematization process of the list. Each stage will be addressed in turn in the following sections.

2.2. Computational analysis

At this stage ‘collocation’ was defined as a single word that tends to co-occur in the span of �3 words from the referenceword, co-occurring at least five times in total across at least five different texts with aMutual Information (MI) score of at least3 and a t-score of at least 2. The MI score indicates the strength of association between the components of the collocation. Thet-score, on the other hand, is a measure of certainty of a collocation, also taking frequency into account. The former is morelikely to give high scores to fixed phrases whereas the latter will yield significant collocates that occur relatively frequently.According to Hunston (2002, p. 75), a collocationwith anMI score of at least 3 and a t-score of at least 2 is considered ‘a strongcollocate, and a certain one’.

The first step of the computational analysis was to obtain a list of content words in the corpus using MonoConc Pro 2.2.Secondly, a list of node words, i.e. the words that occurred at least five times per million words and in at least five differenttexts was compiled. Function words, proper nouns, personal names and non-words were removed manually from this list ifthey occurred in high frequency.Words from the General Service List (West, 1953) were also removed from the nodeword listbut could appear as pre- or post-collocate.

Next, a stop list which contained frequent function words that express little lexical meaning was created, i.e. articles,pronouns, conjunctions, preposition of.1 The stop list was used by the collocation program, specifically written for this project,to exclude sequences composed only of grammatical function words from subsequent analysis.

The list of node words was then used to extract potential collocations from the corpus. In total, this data-driven listcontains over 130,000 entries. Each entry includes the node word, its collocate, the general position of the collocate (pre- or

1 As the focus of the ACL is lexical collocation, the preposition ‘of’ was excluded because it tends to occur in grammatical phrases only (e.g. of the).

Page 4: Developing the Academic Collocation List (ACL) – A corpus-driven and expert-judged approach

Table 2Sample output from the computational analysis.

Pre-collocate Academicword

Post-collocate Position No of texts Raw freq Normed frequency in field of study MI t-score

Per mill AS HM SS NS

becomes apparent �1 46 66 2.96 3.28 2.72 3.25 2.41 7.72 8.09more complex �1 208 870 39.04 38.69 37.10 39.74 40.63 5.76 28.95social context �1 73 350 15.71 18.13 15.72 25.65 1.21 5.03 18.13contribute xxx development �3 17 25 1.15 1.28 0.42 2.35 0.20 4.80 4.82

domain specific 1 9 57 2.56 0.57 9.22 0.54 1.21 6.68 7.48empirical research 1 51 108 4.85 6.14 2.72 8.67 0.80 6.81 10.30

well established �1 131 319 14.32 12.13 16.98 14.09 15.08 6.43 17.65experimental study 1 21 30 1.35 1.86 1.26 1.08 1.01 4.44 5.23holistic approach 1 26 45 2.02 2.28 0.63 3.43 1.41 8.65 6.69

profound implications �1 16 22 0.99 0.71 0.21 1.99 1.01 8.58 4.68provide information �1 109 371 16.65 20.41 14.88 13.01 17.10 5.91 18.94seem paradoxical �1 5 5 0.22 0.29 0.21 0.36 0.00 7.34 2.22

subordinate position 1 15 27 1.21 0.14 1.68 3.25 0.00 7.26 5.16

K. Ackermann, Y.-H. Chen / Journal of English for Academic Purposes 12 (2013) 235–247238

post-), the precise position the collocate most often occurs in, raw frequency, MI score, t-score, the number of texts thecollocation occurs in, and normed rates of occurrence for each of the four fields of study defined in the corpus, i.e. appliedscience & professions (AS), humanities (HM), social sciences (SS), and natural/formal sciences (NS). Table 2 provides a sampleoutput from the computational analysis.

Table 2 also highlights why further refinement was required as the entries may have a very low normed frequency, e.g.seem paradoxical; may have a low distribution in certain fields of study, e.g. subordinate position; or may be part of anextended phrase, e.g. contribute xxx development.

2.3. Refinement

This list of 130,000þ entries required further scrutiny in order to select and present academic collocations in a moresystematic and user-friendly way. This section will explain the refinement process using quantitative and qualitativeparameters.

2.3.1. FilteringAs one of the main objectives of this project was to identify the most frequent collocations across academic disciplines,

quantitative values were first taken into consideration. An explorative pilot investigation was conducted in search for theoptimal combination of cut-off points of MI score, t-score, frequency and distribution, with which the collocations could beidentified while unsuitable combinations could be filtered out. As a result the two principal researchers agreed that onlyentries which met the following quantitative parameters would undergo further analysis: (1) normed frequency �1 permillion; (2) normed frequency �0.2 per million in each field of study; (3) MI score �3; and (4) t-score �4. The t-scorethreshold was raised because it was found that entries with a t-score of less than 4 were mainly noun-preposition combi-nations and fragments of extended phrases, which were not target combinations. Once the entries were filtered using thequantitative parameters (1) to (4), the resulting list was reduced to 16,174 entries. Despite amuchmoremanageable data set,2

this list still required further refinement.

2.3.2. POS-taggingAt this stage, it was decided to apply part-of-speech tagging to each entry to facilitate the extraction of collocations with

specific word–class combinations. Lexical collocations that fall into the following four types of part-of-speech (POS) com-binations are the major targets of our subsequent investigation: verb þ noun (e.g. gather data), adjective þ noun (e.g. sys-tematic approach), adverb þ adjective (e.g. increasingly complex), and adverb þ verb (e.g. significantly affect). This conforms tothe literature of conventional corpus-based collocation research, e.g. verb þ noun combinations investigated by Altenbergand Granger (2001), Laufer and Waldman (2011), Nesselhauf (2005), or intensifying adjectival combinations by Lorenz(1998). Some target POS combinations may meet all the quantitative criteria but are excluded from the list because theyare of little pedagogical value. For example, standard deviation is highly frequent with high MI and t-scores, but this wordcombination is often considered a compound nounwithout any room for commutability and listed as an independent entry inmany dictionaries we consulted. By the same token, frozen expressions such as generally speaking are excluded as they aregenerally regarded as formulae as opposed to collocations by phraseologists. In other words, only free and restricted

2 Here the manageability refers to the data management which would be subject to human judgement at a later stage as well as the learning load forstudents. Cf. The AWL consists of 570 word families and approximately 3,000 words altogether while the AFL presents 200 formulas for the spoken and thewritten registers respectively. Other comparable pedagogical vocabulary listings include the General Service List with the most frequent 2,000 English wordfamilies (West, 1953), the Phrasal Expression List (Martinez & Schmitt, 2012) with 505 entries, and only the top 100 key academic collocations – mostlygrammatical ones – reported in Durrant’s study (2009).

Page 5: Developing the Academic Collocation List (ACL) – A corpus-driven and expert-judged approach

K. Ackermann, Y.-H. Chen / Journal of English for Academic Purposes 12 (2013) 235–247 239

combinations in the traditional phraseological sense as discussed in the collocation continuum abovewill be subject to expertjudgement as these combinations are most challenging for learners due to their varying degree of arbitrary restriction orsubstitutability.

In order to only subject collocations with the target POS combinations to further analysis, the list was tagged using ApacheOpenNLP v1.5.0 applying a simplified set of POS tags, i.e. noun, verb, modal verb, adverb, adjective, preposition, determiner,pronoun, conjunction. Although the entries that were tagged lacked context, the tagging was rather accurate, and only about10 per cent of POS tags had to be corrected manually. Entries with most non-target POS combinations, for example,determiner þ noun (e.g. some historians), preposition þ noun (e.g. within cultures) or preposition þ adverb (e.g. withoutactually) were excluded, whereas noun þ noun combinations (e.g. information retrieval, problem area) and noun þ adjective(e.g. evidence available) were kept for manual review as they appeared valuable from a pedagogical point of view althoughthey do not fall into the four major target categories of POS combinations. The filtered list now contained 6,808 entries. Theseentries were first manually vetted by the principal researchers before being reviewed by experts as described below.

2.3.3. Manual vettingThe 6,808 entries underwent a qualitative review inwhich each entry was assessed independently by the two researchers

to determine whether a specific entry should be included, discussed or excluded from further analysis. The objective of thisstage was to further refine the list by excluding the following types of entries:

1. Linguistically incomplete units (e.g. based approach3)2. Combinations with a high degree of fixedness4 (e.g. collective bargaining)3. Combinations with adverbs referring to time or frequency (i.e. already, now, often)4. Combinations with common transparent adjectives (e.g. good evidence)5. Combinations with concrete geographical references (e.g. European community)6. Combinations that are often hyphenated (e.g. ill-defined)

The independent judgements were then compared and entries where there was no agreement or that were marked as‘discuss’were reassessed. The discussionwas mainly related to entries where there was ambiguity in relation to the degree offixedness, technical specificity or semantic transparency of an entry. At this stage it was decided to opt for inclusion if noclear-cut decision could be made.

After excluding all combinations tagged as ‘exclude’ by both researchers, the remaining 4,558 entries were subjected to theexpert review.

2.4. Expert review

The purpose of the expert review was to judge whether all 4,558 entries, which met the aforementioned quantitative andqualitative criteria, should be included in the final list from a pedagogical point of view, i.e. appropriateness and relevance ofeach entry to the field of EAP. The panel consisted of six experts from different professional backgrounds as below, and therationale of having theoretically oriented as well as practically oriented experts on the panel was to ensure a thoroughassessment of the list entries.

Expert 1: Professor of LinguisticsExpert 2: Professor of English LinguisticsExpert 3: Senior lecturer in EFL/TESOLExpert 4: Dictionary consultantExpert 5: Professor of English Language and LiteratureExpert 6: Lexicographer and publisher

A detailed written brief that outlined the scope and objectives of the project in general and the aims of the ACL inparticular was reviewed and approved by a university EAP lecturer and a publisher from the panel before being sent to all thepanel experts. The questions and rating scale given to the experts were deliberately vaguely formulated to allow for theincorporation of individual viewpoints deriving from the varied backgrounds. However, experts were instructed to contactthe two principal researchers for any clarification before commencing judgement and were encouraged to comment on theirown ratings. Each expert was asked tomake an independent judgement based on the following questions: (1) Is it appropriateto regard the entry as a collocation for teaching and/or learning purposes? (2) Is the collocation pedagogically relevant? Thefollowing four-point Likert scale was used for the judgement:

3 This combination takes one noun before ‘based’, e.g. ‘task based approach’.4 The degree of fixedness was determined by consulting several popular online dictionaries including the Longman Dictionary of Contemporary English

(http://www.ldoceonline.com/) or the Cambridge Dictionary (http://dictionary.cambridge.org/) to see whether the word combinations under investigationare listed as independent entries.

Page 6: Developing the Academic Collocation List (ACL) – A corpus-driven and expert-judged approach

K. Ackermann, Y.-H. Chen / Journal of English for Academic Purposes 12 (2013) 235–247240

1 ¼ definitely exclude2 ¼ not sure, but tendency to exclude3 ¼ not sure, but tendency to include4 ¼ definitely include

Each entry contained the following statistical information: overall normed frequency, normed frequency in each field ofstudy, MI score, and t-score. Experts could use the statistical information to make an informed decision when assessing thecollocation.

The inter-rater reliability was overall moderate (Intraclass Correlation Coefficient 0.524). Experts5 agreed to definitelyinclude 1,215 collocations (27%). Themoderate agreement is probably not surprising as the reviewerswere intentionally chosenfrom heterogeneous backgrounds so that they could provide feedback from different perspectives. If the sum of the expertjudgementswas lessorequal to9, theentrywasexcluded fromthefinal list. In total 385entries (8%)weredismissedat this stage.

It has to be noted that the comments from the panel also significantly contributed to the follow-up systematizationprocess, which was undertaken by the principal researchers. This process is explained in the next section.

2.5. Systematization

As more than one panel member suggested systematizing the entries to provide a listing that would be more readilyaccessible for users, the researchers decided to take the following steps as part of the systematization process:

I. Listing collocations in their base formi. Changing adjectives in the comparatives/superlatives to the base form if appropriateii. Changing nouns in plural to singular unless the noun is a plurale tantum ormore likely to appear in its plural form

as part of the collocationiii. Changing inflected verbs to infinitive

II. Harmonizing entries that appear in British and American English, with British English as the preferred formIII. Adding definite or indefinite articles to verb þ noun collocations in line with dictionary conventions (e.g. resolve (a)

dispute; apply (the) theory)IV. Adding optional copula be to adverb þ verb past participle combinations if they can be used as predicate (e.g. (be)

universally accepted)V. Adding dominant prepositions to collocations (e.g. bear resemblance (to))

In order to make informed decisions, the researchers consulted concordance lines from the corpus and referred back to thedata-driven list. The pros and cons regarding points III to V are discussed later.

3. Results

In this section, an overview of the ACL in terms of part-of-speech combinations is provided and the results of the validationstudy are presented.

3.1. Composition

After several stages of computational analyses and manual refinement as described above, the Academic Collocation Listwith 2,468 entries was completed. The number of entries in various target POS combinations and examples in each categoryare presented in Table 3. As can be seen, with 1,835 entries noun combinations form the largest category comprising nearlythree quarters of the total entries (74.3%, n ¼ 1,835). The second largest category are verb combinations with nouns or ad-jectives as complements (13.8%, n¼ 340), followed by adverbþ verb combinations (6.9%, n¼ 170), which include positionallyvariable adverb þ verb and verb þ adverb entries as well as adverb þ vpp (verb past participle) combinations.Adverb þ adjective combinations are comparably fewer but still cover 5.0% (n ¼ 124) of the list.

A representative selection of adjectiveþ noun and nounþ noun combinations in the ACL are listed in Table 4. These are theentries that received the highest expert agreement (see Section 2.4) indicating almost unanimous agreement among expertsthat these collocations are of pedagogical value.

In terms of verb combinations, again only thosewhich received the greatest expert agreement are listed here. Verbþ nounand verb þ adjective collocations are presented in Table 5, and verb þ adverb collocations with variable sequences in Table 6.In comparison with nominal collocations, verb collocations appear to have received much more research interest in the past,particularly verb þ noun collocations in second language learning. The assumption appears to be that learners tend toencounter more difficulties in choosing verb collocates correctly than any other type of collocation. Very little research has

5 The ratings of one expert had to be disregarded as they were incomplete.

Page 7: Developing the Academic Collocation List (ACL) – A corpus-driven and expert-judged approach

Table 3Overview of final academic collocations in part-of-speech combinations.

Combinations POS I POS II No of entries Percentage Examples

1. Noun combinations adj n 1773 71.8% anecdotal evidence, classic examplen n 62 2.5% assessment process, target audienceTotal 1835 74.3%

2. Verb þ noun/adj combinations v n 310 12.6% gather information, undertake researchv adj 30 1.2% consider appropriate, seem plausibleTotal 340 13.8%

3. Verb þ adv combinations adv v 16 0.6% explicitly state, strongly agreev adv 29 1.2% grow rapidly, vary considerablyadv vpp 124 5.0% previously mentioned, (be) widely dispersedTotal 170 6.9%

4. Adv þ adj combinations adv adj 124 5.0% highly controversial, (be) markedly differentTotal 124 5.0%

Grand total 2468 100%

K. Ackermann, Y.-H. Chen / Journal of English for Academic Purposes 12 (2013) 235–247 241

addressed other POS combinations, particularly noun combinations although they dominate the academic register. Thisdiscrepancy, therefore, may require further research into learners’ collocational use to find out if other types of collocations, incomparison with verb þ noun collocations, also pose challenges to learners.

The final category contains adverb þ adjective combinations. Those with the highest expert agreement are presented inTable 7. It is interesting to note that from Table 4–7, there appears to be a range of recurring words, many of which originatefrom the sameword families, e.g. increase/increasing/increasingly, significant/significantly. Take themost frequent adverb in theACL highly for example. It collocates with 20 different adjectives (e.g. sophisticated, complex, critical) and six verb past par-ticiples (e.g. educated, charged, developed). As for its adjective form high, it collocates with 22 nouns (e.g. level, profile) while itscomparative form higher collocates with two nouns (education, degree).

The implication for EAP pedagogy at the lexico-grammatical level is that despite a seemingly large number of collocationsin the ACL, the actual teaching and learning load should be manageable as many of the words in the ACL are often part of a‘recurrent frame’, and understanding the frame will give learners a sense of familiarity when encountering new collocationswithin the same or a similar frame. Therefore, one possible way of presenting the entries in the classroomwould be to showthe frame, if any, of a node word, e.g. (be) highly þ vpp, and introduce the six verb past participles from the ACL together.

3.2. Validation

In order to determine whether the ACL is representative of the academic register in comparison with non-academicregisters, a validation study was conducted with the aim to investigate the list’s overall coverage of its source corpus aswell as of a comparably-sized general corpus. The general corpus of 25 million tokens was compiled from the BNC includingimaginative writings (i.e. literary and creative works) and informative writings (i.e. leisure component).

For the purpose of the analysis, each ACL entry underwent inflection expansion. In other words, the entry was classed asthe root form with all possible variants added. For example, ‘close j adj j relationship jn’ expands to the following format:

close j adj j relationship j ncloser j adj j relationship j nclosest j adj j relationship j nclose j adj j relationships j ncloser j adj j relationships j nclosest j adj j relationships j n

If the first collocate contained optional words, e.g. test (a) j v j theory j n, then this was expanded into two separate rows asshown below. The POS for the second row was classed ngram as multiword items are not tagged with a single POS. Whensearching for such multiword items, the POS for the constituent tokens were not considered.

test j v j theory j ntest a j ngram j theory j n

This extended list, however, neither included collocations with flexible positions of their components, e.g. conduct researchj research conducted or significantly correlated j correlate significantly, nor collocations with variable gaps, e.g. consider these/

Page 8: Developing the Academic Collocation List (ACL) – A corpus-driven and expert-judged approach

Table 4Adjective þ noun and noun þ noun collocations with the highest combined score.

Normed freqper million

No oftexts

NormedAS

NormedHM

NormedSS

NormedNS

MI StDeva t-score StDev

Adjective Noun1 academic writing 16.74 23 5.28 68.12 1.81 0.20 8.32 19.252 brief overview 2.20 38 2.43 1.68 2.71 1.81 9.85 6.993 causal link 1.53 16 3.14 0.63 1.45 0.20 8.09 5.814 conflicting interests 1.21 13 1.00 0.21 3.07 0.40 8.47 5.185 conventional wisdom 3.72 38 2.86 1.26 9.39 1.01 10.99 9.116 crucial factor 1.80 27 2.00 1.89 2.17 1.01 7.32 6.297 crucial importance 2.06 35 1.71 2.10 2.71 1.81 6.99 6.738 crucial role 4.89 71 3.43 7.55 5.42 3.82 7.11 10.369 cultural heritage 2.51 30 2.00 1.05 3.97 3.02 8.09 7.4610 (a) deep understanding (of) 2.74 43 2.00 3.14 2.71 3.42 8.38 7.7911 disposable income 1.62 16 3.14 0.21 1.81 0.60 12.66 6.0012 dividing line 1.30 22 1.28 1.26 2.35 0.20 8.34 5.3713 domestic violence 13.24 22 22.70 0.63 23.84 0.20 9.72 17.1614 due process 5.39 26 9.56 1.89 6.14 2.01 5.55 10.7215 economic conditions 4.76 50 6.00 1.47 8.49 2.01 5.35 10.0416 economic growth 15.21 60 27.70 2.93 11.92 13.07 7.37 18.3017 economic power 5.43 49 4.43 3.14 10.84 3.02 4.51 10.5218 educational institution 3.90 50 4.71 3.14 5.60 1.61 7.15 1.20 5.88 4.1319 environmental factors 8.26 46 10.14 0.84 12.10 8.45 5.80 2.66 7.86 7.6120 environmental protection 4.62 30 5.00 1.47 1.99 10.06 7.51 10.0921 equal opportunity 8.39 30 14.85 0.84 13.91 0.40 9.01 13.6522 ethnic minority 18.62 55 23.70 5.66 38.29 2.01 10.40 0.46 14.38 0.8623 federal government 12.25 47 24.13 3.56 14.09 1.81 9.21 16.4924 final stage 5.39 72 5.85 3.56 5.06 6.84 7.11 0.09 7.51 2.3225 financial resources 4.22 48 7.57 2.31 3.97 1.61 7.30 9.6326 financial support 5.97 59 8.71 2.72 7.59 3.42 7.09 11.4527 first generation 5.56 51 6.42 5.66 4.34 5.63 5.77 10.9328 foreign investment 4.40 23 8.28 1.68 2.71 3.42 9.00 9.8829 foreign policy 14.05 59 12.85 11.11 28.54 2.41 7.10 1.37 10.92 8.4730 full range 6.37 68 6.85 3.14 10.30 4.42 6.18 11.7531 further information 9.92 52 15.13 6.92 7.77 7.84 4.96 14.3932 further research 6.10 67 6.42 9.64 4.52 4.02 4.39 11.1133 high profile 4.67 45 5.00 3.77 7.77 1.61 7.34 10.1434 higher education 26.52 72 33.55 18.86 31.07 18.91 8.44 24.2435 individual differences 7.05 41 7.28 2.10 15.53 2.01 5.97 12.3336 infinite number 3.90 39 3.43 3.77 0.54 8.45 7.45 9.2737 intellectual property 6.87 25 15.13 2.31 0.72 6.44 9.49 12.3538 interpersonal skills 5.25 14 14.28 1.47 1.63 0.20 9.99 10.8139 key element 8.30 86 13.28 2.93 11.56 2.82 6.12 0.66 9.47 0.0440 key factor 7.58 86 10.56 1.47 9.21 7.44 6.00 0.60 9.04 0.0541 living conditions 4.49 35 7.00 1.05 6.86 1.61 6.72 9.9042 local government 23.78 88 35.40 10.48 39.02 3.22 7.39 0.64 15.79 4.8643 low income 7.63 44 11.14 1.47 10.84 5.03 8.13 12.9944 medical treatment 6.51 27 16.70 0.42 3.25 1.61 7.55 11.9845 mental health 59.64 50 84.37 2.10 130.24 1.41 9.26 36.4046 mental illness 8.71 38 10.85 3.77 17.70 0.40 9.15 13.9047 minimum standard 3.41 28 7.14 0.84 3.25 0.80 6.70 2.39 5.6348 national identity 10.99 49 5.14 6.71 27.28 5.23 6.78 15.5149 native speaker 5.56 46 3.00 18.65 2.17 0.40 9.97 0.94 7.74 2.0150 natural resources 11.62 63 17.56 3.77 6.32 16.69 7.33 0.12 10.83 4.6251 natural world 6.51 46 1.86 11.32 11.92 2.41 5.29 11.7352 next generation 5.03 55 6.28 3.35 3.97 6.03 7.74 10.5353 nuclear power 6.01 29 1.71 4.61 4.34 15.29 7.47 11.5154 nuclear weapon 4.13 30 4.85 1.68 6.50 2.82 11.84 9.5955 ongoing debate 1.80 21 1.71 1.05 3.97 0.20 8.94 6.3156 online database 1.21 6 0.29 0.21 4.15 0.20 8.28 5.1857 paid employment 4.53 22 2.43 1.47 13.01 1.01 8.70 10.0358 physical activity 5.92 33 9.28 2.31 9.57 0.60 5.04 1.51 7.34 4.3159 physical properties 11.94 38 2.14 4.19 0.90 45.45 8.43 16.2660 political economy 18.58 47 22.27 7.34 33.96 7.04 7.85 20.2661 political institution 6.33 35 2.28 4.40 18.42 0.40 6.06 11.7062 political party 9.83 50 5.00 3.77 28.00 2.21 7.19 14.7063 popular culture 52.59 51 13.42 21.17 175.22 1.41 9.09 34.1764 positive feedback 4.58 27 1.57 0.84 1.26 16.09 7.85 10.0665 prior knowledge 3.32 33 3.57 2.72 1.99 5.03 6.59 8.5166 private sector 26.93 83 50.11 3.35 36.13 6.64 9.47 0.68 15.90 9.6367 public sector 12.97 46 26.70 1.47 14.99 2.41 7.73 16.92

K. Ackermann, Y.-H. Chen / Journal of English for Academic Purposes 12 (2013) 235–247242

Page 9: Developing the Academic Collocation List (ACL) – A corpus-driven and expert-judged approach

Table 4 (continued )

Normed freqper million

No oftexts

NormedAS

NormedHM

NormedSS

NormedNS

MI StDeva t-score StDev

68 public transport 8.26 33 17.13 1.68 7.59 2.82 6.47 13.4169 qualitative analysis 4.35 45 3.00 6.08 4.15 4.83 7.42 0.06 6.26 4.1870 racial discrimination 12.21 26 24.13 0.84 17.70 0.20 9.98 16.4871 random sample 2.33 22 3.71 1.47 3.25 0.20 8.25 7.1972 raw data 6.37 46 4.57 9.43 11.20 0.60 9.26 11.9073 religious belief 9.24 71 6.00 15.51 15.72 0.60 8.56 0.99 9.93 2.7874 renewable energy 2.78 14 3.57 0.21 0.36 6.84 10.30 7.8775 ruling class 6.24 25 3.28 5.03 16.08 0.60 9.27 11.7776 scientific research 5.79 54 5.85 4.61 7.23 5.23 5.75 11.1577 significant impact 4.67 57 8.71 1.89 4.70 1.61 6.24 10.0678 small proportion 3.63 53 3.71 1.68 3.61 5.43 7.29 8.9479 social factors 12.12 57 12.99 16.56 15.17 3.22 5.01 15.9280 social mobility 4.13 28 0.71 2.10 13.19 0.80 6.59 9.4981 social status 8.39 58 3.14 17.82 13.91 0.60 4.82 13.1982 solar energy 3.19 16 1.14 1.05 0.72 10.86 7.97 8.3983 stark contrast 1.26 25 1.71 1.26 1.63 0.20 10.17 5.2984 varying degree 6.51 85 6.57 5.87 8.85 4.42 8.68 4.76 7.00 6.8285 whole range 8.08 64 8.85 6.29 12.46 3.82 6.52 13.2786 wide range 48.29 258 49.39 32.91 54.37 54.71 8.78 1.56 21.02 13.74

Noun Noun87 background knowledge 1.93 24 1.28 5.24 0.36 1.41 5.87 6.4588 class consciousness 3.46 19 0.29 3.98 9.94 6.84 8.7089 conflict resolution 2.15 20 2.86 0.21 3.79 1.21 7.65 6.8990 data set 8.48 49 5.00 5.03 8.31 16.89 4.75 13.2491 source material 3.63 35 0.71 11.32 2.17 2.01 5.38 8.78

a The value of standard deviation is added where different forms of collocations are combined, e.g. educational institution/institutions. It reflects thedifference between the two MI scores and t-scores respectively for combined entries.

K. Ackermann, Y.-H. Chen / Journal of English for Academic Purposes 12 (2013) 235–247 243

such/specific issues. It would have been beyond the scope of the validation to attempt to include all possible variations forthose collocations in question.

The results show that the overall coverage of the ACL in the source corpus amounts to 1.4% and for the general corpus to 0.1%.This suggests that the ACL has a 14-times higher coverage in an academic corpus than in a general corpus of similar size. It isinteresting to note that Coxhead (2000) found a similar ratio of the AWL coverage in her source corpus (10%) and a corpus offiction (1.4%). These findings underline the importance that should be placed on teaching and learning these collocations in EAP.

4. Discussion

As described in the methodology section, the ACL was compiled using quantitative analysis followed by qualitativerefinement. During the process of manual vetting and systematization the researchers had to constantly check the selectioncriteria and adapt them. For example, at the outset the intention was to include two-word collocations only. Yet after

Table 5Verb þ noun and verb þ adjective collocations with the highest combined score.

Normed freqper million

No oftexts

NormedAS

NormedHM

NormedSS

NormedNS

MI StDev t-score StDev

Verb Noun1 achieve (a) goal 6.96 85 9.85 5.66 8.31 2.61 8.98 0.49 6.68 3.212 achieve (an) objective 6.19 73 13.56 3.56 3.79 1.01 7.92 0.94 6.64 1.543 cast doubt 1.75 32 0.71 3.35 2.35 1.01 10.07 6.244 make (a) living 5.07 40 1.71 1.89 2.71 15.49 4.44 1.33 5.44 2.885 make (a) prediction 3.99 61 2.71 2.31 5.42 5.83 5.35 1.61 18.31 2.176 meet criteria 3.28 53 5.00 2.72 2.35 2.41 6.95 1.12 4.01 1.627 meet (a) requirement 6.60 93 12.13 3.98 3.61 4.63 7.30 1.47 4.32 2.598 obtain (a) result 9.51 115 10.14 6.92 3.25 18.10 5.53 1.21 6.41 3.729 pose (a) question 7.22 121 5.85 7.76 10.84 4.63 7.65 0.38 5.03 1.2510 provide (a) clue 1.80 28 1.71 2.31 1.81 1.41 8.02 0.43 4.18 2.1911 take precedence 3.19 58 3.71 2.72 3.61 2.41 9.54 0.43 4.72 1.4212 take responsibility 9.24 105 12.99 7.13 13.37 1.41 5.81 0.66 5.63 3.24

Verb Adjective13 make explicit 11.71 174 7.85 21.17 14.99 4.42 6.67 0.30 7.80 2.03

Page 10: Developing the Academic Collocation List (ACL) – A corpus-driven and expert-judged approach

Table 6Verb þ adverb collocations with the highest combined score.

Normed freqper million

No oftexts

NormedAS

NormedHM

NormedSS

NormedNS

MI StDev t-score StDev

Verb Adverb1 differ significantly 3.68 58 3.85 2.72 2.53 5.63 8.63 0.69 4.80 2.492 expand rapidly 2.24 40 2.57 1.47 2.35 2.41 8.32 1.41 3.78 1.843 explore further 1.88 30 1.71 2.31 1.99 1.61 5.59 1.14 4.28 1.974 increase dramatically 2.92 49 3.00 1.89 2.35 4.42 7.63 0.72 3.89 1.115 vary widely 3.10 56 2.00 1.68 3.97 5.03 7.63 1.05 4.53 1.84

Adverb Verb6 adversely affect 4.35 58.00 7.00 0.63 3.97 4.63 11.86 0.42 6.95 0.667 closely resemble 1.53 26 1.43 1.05 1.26 2.41 10.55 5.838 fully understand 9.38 127.00 11.99 8.17 8.49 7.84 7.18 0.88 8.24 1.099 generally agree 2.20 41.00 1.14 2.52 3.43 2.01 6.00 1.51 4.59 2.39

Adverb Verb (past participle)10 (be) closely allied 1.39 15 2.43 1.68 0.54 0.60 10.17 5.5611 (be) deeply rooted 1.66 30 0.86 2.93 2.53 0.60 11.32 6.0812 (be) generally accepted 5.30 78 5.71 6.08 4.88 4.42 8.14 10.8213 (be) generally considered 3.77 43 4.71 2.52 3.43 4.02 6.03 9.0214 (be) inextricably linked 2.47 38 2.71 1.47 3.07 2.41 12.05 7.4115 (be) well established 14.32 131 12.13 16.98 14.09 15.08 6.43 17.6516 (be) widely accepted 5.25 77 3.43 6.71 7.41 4.02 9.34 10.8017 (be) widely used 18.49 104 18.27 10.48 11.20 34.59 7.76 20.20

K. Ackermann, Y.-H. Chen / Journal of English for Academic Purposes 12 (2013) 235–247244

reviewing the data from the computational analysis, it was decided to keep those with extended combinations such as deal xxissue, lead xx conclusion, largely based as they are of pedagogical value. Consulting concordance lines from the source corpusallowed the researchers to identify the most frequent pattern, which was chosen for the final format. The above exampleswere then completed as deal with an issue, lead to the conclusion, and (be) largely based (on). Very often the addition of wordsincluded: articles added to verb þ noun combinations, e.g. raise (an) issue, copula be added to adverb þ vpp combinations ifthey can be used as predicate, e.g. (be) widely recognized, prepositions added to verb or noun combinations, e.g. rely heavily(on) and (a) high proportion (of), or a combination of two of the above conditions, e.g. (be) greatly influenced (by). Providing themost frequent pattern increases the list’s usefulness and applicability for teachers and learners alike.

One major challenge of compiling this list lay in the fact that there seems to be no absolute definition of collocation andthis is particularly true for some word combinations. The moderate inter-rater reliability among the expert panel reported inthe methodology section suggests that which entry qualifies as an academic collocation for pedagogical purposes may beinterpreted rather differently. As discussed in the introduction, collocations represent a continuum of formulaicity and se-mantic opaqueness, and at the earlier stage, it was decided to remove combinations at the two extreme ends, which are eitherfixed expressions (e.g. global warming) or semantically transparent combinations (e.g. first issue). Yet there is no absolutedividing line to determine what ‘fixedness’ or ‘transparency’ is. Take the entry academic life for example. At the stage of expertjudgement, it received a wide range of scores from 1 (exclude) to 4 (include), which indicates the disagreement amongexperts. Whether it is semantically transparent may be debatable, yet it was accepted as an entry in the list because inaddition to meeting all quantitative criteria, this entry still received a moderate total score of 15 from the experts. Comparingthe appropriateness of the following examples that received a total score below 10 (Table 8) and the ones with the highestexpert scores (Table 4–7), it becomes clear why we believe that adopting expert judgement in selecting the entries andutilizing the feedback from the panel such as systematizing the entries have contributed significantly to the quality of thislisting, which could not have been achieved by relying solely on computational analysis.

Table 7Adverb þ adjective collocations with the highest combined score.

Adverb Adjective Normed freqper million

No oftexts

NormedAS

NormedHM

NormedSS

NormedNS

MI t-score

1 ever increasing 3.37 52 3.14 3.56 4.52 2.21 7.71 8.622 hardly surprising 3.05 43 2.43 3.98 3.61 2.41 10.79 8.243 increasingly important 5.07 60 6.71 1.68 4.88 6.23 5.86 10.454 mutually exclusive 5.56 70 4.00 8.80 7.23 2.82 13.24 11.136 radically different 4.67 71 1.71 9.01 6.14 3.02 7.76 10.157 relatively few 6.15 76 5.71 2.72 9.75 6.03 7.10 11.628 relatively high 6.33 65 5.85 4.40 6.32 8.85 6.12 11.709 relatively little 7.18 88 4.14 6.71 13.19 5.23 7.16 12.5610 relatively simple 4.53 60 4.28 5.87 2.17 6.23 7.02 9.9711 slightly different 9.60 118 9.42 8.80 7.95 12.47 7.35 14.54

Page 11: Developing the Academic Collocation List (ACL) – A corpus-driven and expert-judged approach

Table 8Examples rejected due to low combined score.

Component I Component II Normed freqper million

No oftexts

NormedAS

NormedHM

NormedSS

NormedNS

MI t-score

1 use detection 1.17 7 2.43 0.21 0.54 1.01 5.18 4.962 access control 1.62 13 3.14 0.21 0.72 1.81 3.99 5.623 imagined community 1.12 14 0.29 1.05 2.35 1.01 7.02 4.964 individual criminal 1.08 5 2.71 0.21 0.54 0.20 4.59 4.695 biological basis 1.26 17 0.43 0.21 3.97 0.40 6.18 5.22

K. Ackermann, Y.-H. Chen / Journal of English for Academic Purposes 12 (2013) 235–247 245

Nonetheless three issues which required further manual scrutiny became apparent after the expert review. First of all, asthe words processed by the computer were not lemmatized, nouns with singular and plural forms or inflected verbs wereprocessed separately despite having identical collocates and hence some entries were duplicated. To rectify this issue theoriginal computational output was checked again manually, and duplicates such as fundamental assumption/fundamentalassumptions were combined. For noun combinations where both singular and plural forms were present, only the morefrequent form was retained, e.g. the singular form fundamental assumption, but the plural form common characteristics. Forverb combinations, inflected forms were converted to the infinitive, e.g. deny access instead of denied access or denying access.For combinations with verb past participles (vpp), it proved more complicated as they often occur in a variety of contexts –modifying nouns as adjectives (see Example 1) or collocating with copula be or linking verbs as complement (see Examples 2–3). It is also possible that a main verb shares the same formwith its verb past participle, in which case the current automatedcorpus-driven approach is unable to distinguish them (compare Examples 3 and 4). The variable position of adverb collocateadds even more complexity to this matter (see Example 5). To solve this issue the following policy was adopted: if the vppforms generally function as complement in the concordance lines as in Examples 2–3, then an optional copula be would beadded. In addition, only the most frequent pattern was kept; therefore, directly involved is included as opposed to involveddirectly, which occurs less frequently.

Examples (as retrieved from the corpus)

(1) We use a well established procedure to identify the regions.(2) Firstly, it is now well established that investigations into human rights abuses .(3) . that the two leaders will have to become directly involved in talks rather than acting through negotiators.(4) Only those negative outcomes that directly involved the pupil were included.(5) When the nurse has not been involved directly in the administration of the medication.

The second issuewas that some combinations were excluded because the individual inflective formswith collocates failedto meet the frequency threshold, yet their combined frequency is actually higher than the cut-off frequency. The expert panelsuggested a number of such collocations that were not covered in the reviewed list. The researchers then scrutinized theoriginal data-driven list of 130,000þ entries again in search for any collocations that had been missed out as a result of in-flections. Another 9.7% (n ¼ 239) of additional combinations (e.g. differ considerably or exercise authority) were added to thelist at this stage. The above tasks were very time-consuming, yet such manual scrutiny was required in order to furtherimprove the quality of the listing.

The final issue involves whether highly sensitive entries should be included in the ACL. The principal researchers decidedto exclude entries such as mental retardation as some experts advised that these entries could be potentially offensive.

One interesting finding is the dominance of nominal combinations in the list. This might indicate that nouns have greatertendency to collocate than other word classes, which would require further analysis. It might also be reflective of the writtenregister, particularly in scientific writing, which is characterized by a large proportion of nominalization in academic textsindicating high information content. For example, Biber and Gray (2010: 2) found that academic writing is ‘structurally“compressed”, with phrasal (non-clausal) modifiers embedded in nounphrases’. Fang, Schleppegrell, and Cox (2006) reportedthat the extensive use of nouns and nominal expressions in academic text can pose great challenges for comprehension (forthe discussion of nominalization in the written register, see also Halliday, 1985; Quirk, Greenbaum, Leech, & Svartvik, 1985).Yet very little research has been conducted on adjectival or nominal collocations in academic English. One of the few ex-ceptions is a study looking at intensifying adjectives such as important or different conducted by Lorenz (1998). Lorenz’ study,however, focuses on the contrastive use of adjective intensification between native and non-native writing rather than theuse of academic collocations per se.

One may criticize that the listing has undergone various stages of filtering and many potentially useful and valuablecombinations might have been removed from the final ACL during this process. However, we argue that providing too muchinformation - in this case too many collocational entries - can be overwhelming for learners. Research has shown that evenwith collocations dictionaries at hand, students still struggle to find correct verb collocates in a given task (Dziemianko, 2010;Laufer, 2011). We believe that the ACL, composed of carefully selected collocations, can serve as part of a lexical syllabus toraise learners’ awareness of word co-occurrence and help them prioritize the learning of lexical items.

Page 12: Developing the Academic Collocation List (ACL) – A corpus-driven and expert-judged approach

K. Ackermann, Y.-H. Chen / Journal of English for Academic Purposes 12 (2013) 235–247246

5. Conclusion

The Academic Collocation List comprises 2,468 entries. Our definition of collocations refers to word combinations whichco-occur more frequently than by chance across academic disciplines (hence corpus-driven) and are pedagogically relevant inan EAP context (hence expert-judged). Within the scope of this definition, we primarily focused on lexical collocations as theycontain certain variability and are thus more dynamic while grammatical collocations or idioms consist of comparativelyfixed patterns and are consequently more predictable. The former is more challenging for learners to master whereas thelatter can generally be treated as holistic units and learners can more easily internalize the usage into their lexicon.

The ACL was compiled using a mixed-method approach of combining computational analysis of the source corpus withexpert judgement and systematization. As pointed out above, existing corpus-driven multi-word lists often fail to provideimmediate usable resources for language learning, and it is only with expert intervention that raw data can be filtered andrefined in order to extract the most informative and meaningful entries. In the case of the ACL, the statistical informationserved as important reference, but in addition each collocation included in the final list had been subjected to expertjudgement and manual refinement. The advantage of relying not only on computational analysis but also on human inter-vention is that the data-driven approach ensures that important combinations are not missed while expert judgement en-sures that the final entries are appropriate and relevant for EAP.

This approach led to an Academic Collocation List that will be of much greater use to language learners and EAP teachersalike. In addition to the Academic Word List and the Academic Formulas List, the ACL provides a further tool for EAP teachersto construct appropriate teaching materials and help students focus on frequent lexical items beyond individual words.

As research has shown, collocations are difficult to learn and retain even with the assistance of dictionaries. The list cantherefore be used to support learning by drawing attention to collocation per se. By subdividing the ACL, students andteachers can focus exclusively, for example, on adjective þ noun combinations, on common frames or collocation families.Explicit teaching needs to be complemented by providing students with the opportunity to encounter collocations whendealing with academic texts. Combining both explicit and implicit teaching will enhance learners’ receptive and productivecollocational knowledge and thus improve their academic English proficiency. EAP material writers may also benefit fromintegrating the ACL into the syllabus.

In terms of directions for future research, as Hyland and Tse (2007) suggested, further research into an extended ACL maybe required to highlight specific collocations that are exclusively frequent in individual fields of studies. We also propose thatfuture research should aim at categorizing the entries of the ACL on the basis, for example, of semantics or discourse func-tions. For instance, the hedging function of some of the entries, e.g. virtually impossible, relatively stable, is one feature thatcame to light during the compilation process. Types of errors or underuse/overuse in learner language may be identified tohelp students improve their academic proficiency as well.

Acknowledgements

We thank Douglas Biber, Regents’ Professor in the Applied Linguistics Program at Northern Arizona University, andBethany Gray, Assistant Professor in the Department of English and the TESL/Applied Linguistics Program at Iowa StateUniversity, for conducting the computational analysis of the source corpus.

We would like to express our gratitude to Andrew Roberts, Computational Linguist, for tagging the initial collocation listand conducting the validation study of the Academic Collocation List.

We are also grateful to the members of the expert panel: David Crystal, Honorary Professor of Linguistics, University ofBangor; Geoffrey Leech, Emeritus Professor of English Linguistics, Lancaster University; Diane Schmitt, Senior Lecturer in EFL/TESOL, Nottingham Trent University; Della Summers, Dictionary Consultant; Professor Lord Randolph Quirk, FBA; and ChrisFox, Managing Editor, Pearson.

We would also like to thank Mike Mayor, Editorial Director, Dictionaries & Reference, Pearson for contributing valuableadvice and John H.A.L. De Jong, Senior Vice President, Standards and Quality Office, Pearson for his support throughout theproject.

Last but not least, we thank the anonymous reviewers for their helpful and insightful comments and suggestions.The complete Academic List can be accessed via the following link: http://www.pearsonpte.com/research/Pages/

CollocationList.aspx

Appendix A. Supplementary dataSupplementary data related to this article can be found at http://dx.doi.org/10.1016/j.jeap.2013.08.002.

References

Altenberg, B., & Granger, S. (2001). The grammatical and lexical patterning of MAKE in native and non-native student writing. Applied Linguistics, 22(2), 173–195.

Bahns, J., & Eldaw, M. (1993). Should we teach EFL students collocations? System, 21(1), 101–114.Benson, M. (1985). Collocation and idioms. In R. Ilson (Ed.), Dictionaries, lexicography and language learning (pp. 61–68). Oxford: Pergamon Press.Biber, D., & Conrad, S. (1999). Lexical bundles in conversation and academic prose. In H. Hasselgard, & S. Oksefjell (Eds.), Out of corpora: Studies in honour of

Stig Johansson (pp. 181–190). Amsterdam: Rodopi.

Page 13: Developing the Academic Collocation List (ACL) – A corpus-driven and expert-judged approach

K. Ackermann, Y.-H. Chen / Journal of English for Academic Purposes 12 (2013) 235–247 247

Biber, D., Conrad, S., & Cortes, V. (2004). If you look at.: lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405.Biber, D., & Gray, B. (2010). Challenging stereotypes about academic writing: complexity, elaboration, explicitness. Journal of English for Academic Purposes,

9(1), 2–20.Biskup, D. (1992). L1 influence on learners’ renderings of English collocations. A Polish/German empirical study. In P. J. L. Arnaud, & H. Béjoint (Eds.),

Vocabulary and applied linguistics (pp. 85–93). London: Macmillan.Chen, Y. H., & Baker, P. (2010). Lexical bundles in native and non-native academic writing. Language Learning and Technology, 14(2), 30–49.Cobb, T. (2003). Analyzing late interlanguage with learner corpora: Quebec replications of three European studies. Canadian Modern Language Review, 59(3),

393–423.Cowie, A. P. (1981). The treatment of collocations and idioms in learners’ dictionaries. Applied Linguistics, 2(3), 223–235.Cowie, A. P. (1994). Phraseology. In R. E. Asher (Ed.), The encyclopedia of language and linguistics (Vol. 6; pp. 3168–3171). Oxford: Pergamon.Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213–238.Durrant, P. (2009). Investigating the viability of a collocation list for students of English for academic purposes. English for Specific Purposes, 28(3), 157–169.Dziemianko, A. (2010). Paper or electronic? The role of dictionary form in language perception, production, and the retention of meanings and collocation.

International Journal of Lexicography, 233, 257–273.Fang, Z., Schleppegrell, M., & Cox, B. (2006). Understanding the language demands of schooling: nouns in academic registers. Journal of Literacy Research,

38(3), 247–273.Granger, S. (1998). Prefabricated patterns in advanced EFL writing: collocations and formulae. In A. Cowie (Ed.), Phraseology: Theory, analysis and appli-

cations (pp. 145–160). Oxford: Oxford University Press.Granger, S., & Meunier, F. (Eds.). (2008). Phraseology: An interdisciplinary perspective. Amsterdam: John Benjamins.Granger, S., & Paquot, M. (2008). Disentangling the phraseological web. In S. Granger, & Meunier (Eds.), Phraseology: An interdisciplinary perspective.

Amsterdam & Philadelphia: John Benjamins.Halliday, M. A. K. (1985). An introduction to functional grammar. London: Continuum.Hoey, M. (2005). Lexical priming: A new theory of words and language. London: Routledge.Howarth, P. (1998). Phraseology and second language proficiency. Applied Linguistics, 19(1), 24–44.Hunston, S. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press.Hyland, K. (2008). Academic clusters: text patterning in published and postgraduate writing. International Journal of Applied Linguistics, 18(1), 41–62.Hyland, K., & Tse, P. (2007). Is there an “Academic vocabulary”? TESOL Quarterly, 412(2), 235–253.Laufer, B. (2011). The contribution of dictionary use to the production and retention of collocations in a second language. International Journal of Lexi-

cography, 24(1), 29–49.Laufer, B., & Waldman, T. (2011). Verb-Noun collocations in second language writing: a corpus analysis of learners’ English. Language Learning, 61(2), 647–

672.Lorenz, G. (1998). Overstatement in advanced learners’ writing: stylistic aspects of adjective intensification. In S. Granger (Ed.), Learner English on computer

(pp. 53–66). London and New York: Addison Wesley Longman Limited.Martinez, R., & Schmitt, N. (2012). A phrasal expressions list. Applied Linguistics, 33(3), 299–320.Nation, I. S. P. (2001). Learning vocabulary in another language. Cambridge: Cambridge University Press.Nesselhauf, N. (2003). The use of collocations by advanced learners of English and some implications for teaching. Applied Linguistics, 24(2), 223–242.Nesselhauf, N. (2005). Collocations in a learner corpus. Amsterdam: John Benjamins.Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J. (1985). A comprehensive grammar of the English language. London: Longman.Schmitt, N. (Ed.). (2004). Formulaic sequences: Acquisition, processing, and use. Amsterdam: John Benjamins.Simpson-Vlach, R., & Ellis, N. C. (2010). An academic formulas list: new methods in phraseology research. Applied Linguistics, 31(4), 487–512.Stubbs, M. (2001). Words and phrases: Corpus studies of lexical semantics. Oxford: Blackwell.Tognini-Bonelli, E. (2001). Corpus linguistics at work: Studies in corpus linguistics (Vol. 6). Amsterdam: John Benjamins.West, M. (1953). A general service list of English words. London: Longman.Wray, A. (2008). Formulaic language: Pushing the boundaries. Oxford: Oxford University Press.

Kirsten Ackermannworked as Research Manager for Pearson during which she conducted research on Pearson’s testing programmes focusing on corpus-based research, rater standardization and CEFR alignment. Kirsten holds aMaster’s degree in English, Political Science and Education from the Free Universityand one in British Studies from the Humboldt University Berlin, Germany.

Yu-Hua Chen holds a PhD in Linguistics from Lancaster University, UK. Her doctoral dissertation ‘Investigating Lexical Bundles across Learner WritingDevelopment’ was selected as a finalist for Jacqueline Ross TOEFL Dissertation Award. She is interested in how corpus analysis can facilitate or validate theapproaches to teaching and assessing language skills.