what can readability measures really tell us … · what can readability measures really tell us...

8

Click here to load reader

Upload: lydieu

Post on 30-Jul-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What Can Readability Measures Really Tell Us … · What Can Readability Measures Really Tell Us About Text Complexity? Sanja Stajner, Richard Evans, Constantin Orˇ asan, and Ruslan

What Can Readability Measures Really Tell Us About Text Complexity?

Sanja Stajner, Richard Evans, Constantin Orasan, and Ruslan Mitkov

Reseach Institute in Information and Language ProcessingUniversity of Wolverhampton

[email protected], [email protected], [email protected], [email protected]

AbstractThis study presents the results of an initial phase of a project seeking to convert texts into a more accessible form for people withautism spectrum disorders by means of text simplification technologies. Random samples of Simple Wikipedia articles are comparedwith texts from News, Health, and Fiction genres using four standard readability indices (Kincaid, Flesch, Fog and SMOG) and sixteenlinguistically motivated features. The comparison of readability indices across the four genres indicated that the Fiction genre wasrelatively easy whereas the News genre was relatively difficult to read.The correlation of four readability indices was measured,revealing that they are almost perfectly linearly correlated and that this correlation is not genre dependent. The correlation of the sixteenlinguistic features to the readability indices was also measured. The results of these experiments indicate that some of the linguisticfeatures are well correlated with the readability measures and that these correlations are genre dependent. The maximum correlationwas observed for fiction.

Keywords: text simplification, readability, autism spectrum disorders

1. IntroductionText simplification can be regarded as the process ofconverting input text into a more accessible form. Theconversion process may be facilitated by research in variousareas of NLP, including lexical simplification (Yatskaret al., 2010), anaphora resolution (Mitkov, 2002), wordsense disambiguation (Escudero et al., 2000), syntacticsimplification (Siddharthan, 2006; Evans, 2011), textsummarisation (Orasan and Hasler, 2007), or imageretrieval (Bosma, 2005).In the context of personalisable applications, it is necessaryfor systems not only to simplify text, but also todiscriminate between material that should be simplified andmaterial that should not be, for the benefit of a particularuser. This discrimination can be realised by quantifying thedifficulty of the material by means of various features ofthe text, and comparing those feature values with thresholdsspecified in user preferences.The work described in this paper is part of an ongoingproject that develops tools to help readers with autismspectrum disorders (ASD). One of the prerequisites forthis research is to have a way to assess the difficultyof texts. A set of metrics is proposed with the aim ofquantifying the difficulty of input documents with respectto their requirements. This set contains readability indicesand metrics inspired by the needs of people with ASD.Documents from several genres are evaluated with regard tothese metrics and the correlation between them is reported.

1.1. Requirements of Users with Autism SpectrumDisorders

This paper presents research undertaken in the initialphase of FIRST,1 a project to develop language technology(LT) that will convert documents from various genres in

1A Flexible Interactive Reading Support Tool(http://www.first-asd.eu).

Bulgarian, English, and Spanish into a more accessibleform for readers with autism spectrum disorders (ASD).ASD are defined as neurodevelopmental disorderscharacterised by qualitative impairment in communicationand stereotyped repetitive behaviour. They are seriousdisabilities affecting approximately60 people out of every10 000 in the EU. People with ASD usually have languagedeficits with a life-long impact on their psychosocialfunctioning. These deficits are in the comprehensionof speech and writing, including misinterpretation offigurative language and difficulty understanding complexinstructions (Minshew and Goldstein, 1998). In manycases, people with ASD are unable to derive the gist ofwritten documents (Nation et al., 2006; O’Connor andKlein, 2004; Frith and Snowling, 1983).Written documents pose various obstacles to readingcomprehension for readers with ASD. These include:

1. Ambiguity in meaning:

(a) Figurative language such as metaphor andidioms,

(b) Non-literal language such as sarcasm,

(c) Semantically ambiguous words and phrases,

(d) Highly specialised/technical words and phrases.

2. Structural complexity:

(a) Morphologically, orthographically, andphonetically complex words,

(b) Syntactically complex sentences,

(c) Inconsistent document formatting.

A detailed study of user requirements derived from afocus group partially supported the initial hypothesis oftheir reading comprehension difficulties. The focus groupmade recommendations for the automatic simplification

Page 2: What Can Readability Measures Really Tell Us … · What Can Readability Measures Really Tell Us About Text Complexity? Sanja Stajner, Richard Evans, Constantin Orˇ asan, and Ruslan

of phenomena at various linguistic levels. This includesthe automatic expansion and elaboration of acronyms andabbreviations (obstacle 1d); the replacement of ambiguouswords/homographs by less ambiguous words (obstacle1c); the substitution of anaphoric references by theirantecedents, especially in the case of zero anaphora(obstacle 1c); the rewriting of long sentences as sequencesof short sentences, the conversion of passive sentencesinto active sentences (obstacle 2b); and the translationof phraseological units such as collocations, idioms,and ironic/sarcastic statements into a more literal form(obstacles 1a and 1b).In addition to the removal of obstacles to readingcomprehension, recommendations were also made for theaddition of indicative summaries, multimedia, and visualaids to the converted documents output by FIRST.

1.2. Readability IndicesIndependent of the specific requirements of readers withASD, readability indices are one means by which thereading difficulty of a document can be estimated. DuBay(2004) notes that over 200 readability formulae have beendeveloped so far, with over1 000 studies of their applicationpublished. In the research described in the present paper,the Flesch Reading Ease score (Flesch, 1949), the Kincaidreadability formula (Kincaid et al., 1986), the Fog Index(Gunning, 1952), and SMOG grading (McLaughlin, 1969)metrics were selected for this purpose. Considering each inturn:The Flesch Reading Ease scoreis obtained by theformula:

Score = 206.835− (1.015×ASL)− (84.6×ASW )

Here,ASL denotes the average sentence length andASW

the average number of syllables per word. The FleschReading Ease Formula returns a number from1 to 100,rather than grade level. Documents with a Flesch ReadingEase score of30 are considered “very difficult” whilethose with a score of70 are considered “easy” to read.The software developed in FIRST is therefore required toconvert documents into a form with a Reading Ease Scorehigher than90, commensurate with fifth grade readinglevel.The Flesch-Kincaid readability formula2 is a simplifiedversion of the Flesch Reading Ease score. It is basedon identification of the average sentence length of thedocument to be assessed (ASL) and the average number ofsyllables per word in the document (ASW ). The formulaestimates readability by US grade level (GL):

GL = (0.4×ASL) + (12×ASW )− 15

The Fog Index (Gunning, 1952) exploits two variables:average sentence length and the number of wordscontaining more than two syllables (“hard words”) foreach 100 words of a document. This index returns the US

2To avoid confusion, in the current paper, theFlesch-Kincaidreadability formulawill hereafter be referred to as theKincaidreadability formula.

Grade Level (GL) of the input document, according to theformula:

GL = 0.4 × (average sentence length + hard

words).

The SMOG grading (McLaughlin, 1969) is computedby considering the polysyllable count, equivalent to thenumber of words that contain more than two syllables in30 sentences, and applying the following formula:

SMOG grading = 3 +√polysyllable count

It has been noted that the SMOG formula is quite widelyused, particularly in the preparation of US healthcaredocuments intended for the general public.3

The selection of these standard readability metrics wasmade due to the observation that, although based ondifferent types of information, they all demonstratesignificant correlation in their prediction of the relativedifficulty of the collections of documents assessed in theresearch described in this paper.The standard readability metrics were computed using theGNU stylepackage, which exploits an automatic methodfor syllable identification. Manual studies of the efficacyof this module suggest that it performs with an accuracyof roughly 90%, similar to state of the art part-of-speechtaggers.

2. Related WorkPrevious research has shown that the average US citizenreads at the seventh grade level (NCES, 1993). Experts inhealth literacy have recommended that materials to be readby the general population should be written at fifth or sixthgrade level (Doak et al., 1996; Weiss and Coyne, 1997).The FIRST project aims to produce documents suitable forusers with reading comprehension problems. Due to thereading difficulties of people with ASD, documents outputby the software developed in the project should not exceedthe fifth grade level (suitable for people with no readingcomprehension difficulties at ten or eleven years old).Together, these constraints emphasise the desirability ofconsistent and reliable methods to quantify the readabilityof documents.In Flesch (1949), it was found that documents presentingfictional stories lay in the range70 ≤ Score ≤ 90. Onlycomics were assigned a higher score for reading ease thanthis. The most difficult type of document was that ofscientific literature, with0 ≤ Score ≤ 30. During the1940s, the Reading Ease Scores of news articles were at thesixteenth grade level. It is estimated that in contemporarytimes, this has been reduced to eleventh grade level.The set of linguistic features employed in the researchdescribed in this paper (Section 3.2.) shares some similaritywith the variables shown by Gray and Leary (1935) to be

3For example, the Harvard School of Public Health providesguidance to its staff on the preparation of documents foraccess by senior citizens that is based on the SMOG formula(http://www.hsph.harvard.edu/healthliteracy/files/howtosmog.pdf,last accessed 1st March 2012).

Page 3: What Can Readability Measures Really Tell Us … · What Can Readability Measures Really Tell Us About Text Complexity? Sanja Stajner, Richard Evans, Constantin Orˇ asan, and Ruslan

closely correlated with reading difficulty. These variablesinclude the number of first, second, and third personpronouns (correlation of 0.48), the number of simplesentences within the document (0.39), and the number ofprepositional phrases occurring in the document (0.35).There is also some similarity with features exploited byColeman (1965) in several readability formulae. Thesefeatures include counts of the numbers of pronouns andprepositions occurring in each 100 words of an inputdocument.DuBay (2004) presents the arguments of severalresearchers who criticise the standard readability indiceson numerous grounds. For example, the metrics have beennoted to disagree in their assessment of documents (Kern,2004). However, DuBay defends their use, arguing thatthe important issue is the degree of consistency that eachformula offers in its predictions of the difficulty of a rangeof texts and the closeness with which the formulae arecorrelated with reading comprehension test results.Research by Coleman (1971) and Bormuth (Bormuth,1966) highlighted a close correlation between standardreadability metrics and the variables shown to be indicativeof reading difficulty. These findings motivate the currentinvestigation into potential correlation between standardreadability metrics and the metrics sensitive to theoccurrence of linguistic phenomena.

3. MethodologyThis section describes the methodology employed in orderto explore potential correlations between the standardreadability indices and the linguistic features used tomeasure the accessibility of different types of documentfor readers with ASD. It contains a description of thecorpora (Section 3.1.), details of the linguistic featuresofaccessibility that are pertinent for these readers (Section3.2.), and details of the means by which the values of thesefeatures were automatically obtained (Section 3.3.).

3.1. CorporaThe LT developed in the FIRST project is intended toconvert Bulgarian, English, and Spanish documents fromfiction, news, and health genres4 into a form facilitatingthe reading comprehension of users with ASD. The currentpaper focuses on the processing of documents written inEnglish.Collections of documents from these genres were compiledon the recommendation of clinical experts within theproject consortium. This recommendation was based on theprediction that access to documents of these types wouldboth motivate research into the removal of a broad spectrumof obstacles to reading comprehension and also serve toimprove perceptions of inclusion on the part of readers withASD. In the current paper, the assessment of readability ismade with respect to the following document collections(Table 1):

1. NEWS - a collection comprising reports on courtcases in the METER corpus (Gaizauskas et al., 2001)

4In this paper, we use the termhealth to denote documentsfrom the genre of education in the domain of health.

and articles from thePRESScategory of the FLOBcorpus.5 The documents selected from FLOB wereeach of approximately2 000 words in length. Thenews articles from the METER corpus were rathershort; none of them had more than1 000 words. Weincluded only documents with at least 500 words;

2. HEALTH - a collection comprising healthcareinformation contained in a collection of leaflets fordistribution to the general public, from categoriesA01,A0J, B1M, BN7, CJ9, andEDBof the British NationalCorpus (Burnard, 1995). This sample containsdocuments with considerable variation in word length;

3. FICTION - a collection of documents from theFICTION category of the FLOB corpus. Each isapproximately2 000 words in size; and

4. SIMPLE W IKI - a random selection of simplifiedencyclopaedicdocuments, each consisting of morethan 1 000 words, from Simple Wikipedia.6 Thiscollection is included as a potential model ofaccessibility. One of the goals of the researchdescribed in this paper is to compare the readabilityof other types of document from this “standard”.

Corpus Words TextsSimpleWiki 272,445 170

News 299,685 171Health 113,269 91Fiction 243,655 120

Table 1: Size of the corpora

3.2. Linguistic Features of Document Accessibility

The obstacles to reading comprehension faced by peoplewith ASD when seeking to access written informationwere presented in Section 1.1. The features presented inthis section are intended to indicate the occurrence ofthese obstacles in input documents. Thirteen features areproposed as a means of detecting the occurrence of thedifferent types of obstacle to reading comprehension listedin Section 1.1. Related groups of features are presentedbelow.

(1) Features indicative of structural complexity: Thisgroup of ten features was inspired by the syntacticconcept of the projection principle (Chomsky, 1986) that“lexical structure must be represented categorically at everysyntactic level”. This implies that the number of nounphrases in a sentence is proportional to the number ofnouns in that sentence, the number of verbs in a sentenceis related to the number of clauses and verb phrases, etc.The values of nine of these features were obtained byprocessing the output ofMachinese Syntax7 to detect the

5Freiburg-LOB Corpus of British English(http://khnt.hit.uib.no/icame/manuals/flob/INDEX.HTM)

6http://simple.wikipedia.org7http://www.connexor.eu

Page 4: What Can Readability Measures Really Tell Us … · What Can Readability Measures Really Tell Us About Text Complexity? Sanja Stajner, Richard Evans, Constantin Orˇ asan, and Ruslan

Feature Indicator ofNouns (N) References to concepts/entitiesAdjectives (A) Descriptive information about concepts/entitiesDeterminers (Det) References to concepts that are not proper names, acronyms, or abbreviationsAdverbs (Adv) Descriptive information associated with properties of and relations between concepts/entitiesVerbs (V) Properties of and relations between concepts/entitiesInfinitive markers (INF) Infinitive verbs (a measure of syntactic complexity)Coordinating conjunctions (CC) Coordinated phrasesSubordinating conjunctions (CS) Subordinated phrases, including phrases embedded at multiple levelsPrepositions (Prep) Prepositional phrases (a well-cited source of syntactic ambiguity and complexity)

Table 2: Features (structural complexity)

occurrence of words/lemmas with particular part-of-speechtags (Table 2). As the tenth feature we proposedSentencecomplexity(Compl) in terms of number of verb chains.It was measured as the ratio of the number of sentencesin the document containing at most one verb chain to thenumber containing two or more verb chains. To illustrate,the sentence:

I am consumed with curiosity, and I cannot restuntil I know why this Major Henlow should havesent the Runners after you.

contains four verb chains:{am consumed}, {cannot rest},{know}, and{should have sent}. This feature exploits thefunctional tags assigned to different words byMachineseSyntax(Section 3.3.1.).

(2) Features indicative of ambiguity in meaning: Thisgroup of three features (Table 3) is intended to indicate theamount of semantic ambiguity in the input document.

Feature Indicator ofPronouns (Pron) Anaphoric referencesDefinite descriptions (defNP) Anaphoric referencesWord senses (Senses) Semantic ambiguity

Table 3: Features (ambiguity in meaning)

In all three cases, the difficulties caused by the feature ariseas a result of doubts over the reference to concepts in thedomain of discourse by different linguistic units (wordsand phrases). The values of these features are obtained byprocessing the output ofMachinese Syntaxto detect boththe occurrence of words/lemmas with particular parts ofspeech and the functional dependencies holding betweendifferent words, and exploitation of WordNet as a source ofinformation about the senses associated with content wordsin the input text.These features were calculated as averages per sentence.The only exception was the featureSenseswhich wascomputed as the average number of senses per word.

3.3. Extraction of Linguistic Features

A user requirements analysis undertaken during the initialstage of the project motivated the development of featuresof accessibility based on the occurrence of variouslinguistic phenomena in an input document. Given that

these are complex and difficult to detect automatically,the linguistic features are based on indicative morpho-syntactic information that can be obtained via existing NLPresources.Derivation of the feature values depends on exploitation oftwo language technologies: Connexor’sMachinese Syntaxfunctional dependency parser (Tapanainen and Jarvinen,1997) and the generic ontology, WordNet (Fellbaum,1998). The detection process is based on the assumptionthat words with particularmorphological and surfacesyntactictags assigned byMachinese Syntaxindicate theoccurrence of different types of linguistic phenomenon.One caveat that should be made with regard to the valuesobtained for these features is that they exploit languageprocessing technology that is imperfect in its accuracy andcoverage. The efficacy of Connexor’sMachinese Syntax,used to obtain the values for the linguistic features, isdescribed in (Tapanainen and Jarvinen, 1997).

3.3.1. Functional DependenciesThe values of two features,defNPandCompl, are obtainedby reference to the functional dependencies detected byMachinese Syntaxbetween words in the input documents.The feature defNP is intended to obtain the number ofdefinite noun phrases occurring in each sentence of aninput document. This number is measured by countingthe number of times that functional dependencies occurbetween tokens with the lemmathe, this, and that andtokens with a nominal surface syntactic category.The featureCompl, which relies on identification of theverb chains occurring in each sentence of a document (seeSection 3.2.), exploits analyses provided by the parsingsoftware. Verb chains are recognised as cases in whichverbs are assigned eitherfinite main predicatoror finiteauxiliary predicatorfunctional tags byMachinese Syntax.

3.3.2. WordNetWord sense ambiguity (Senses) was detected byexploitation of the WordNet ontology (Fellbaum, 1998).Input documents are first tokenised and each tokendisambiguated in terms of its surface syntactic categoryby Machinese Syntax. The number of concepts linked tothe word when used with that category were then obtainedfrom WordNet. The extraction method thus exploitssome limited word sense disambiguation as a result of theoperation of the parser. As noted earlier (Section 3.2.),the featureSenseswas calculated as the average number

Page 5: What Can Readability Measures Really Tell Us … · What Can Readability Measures Really Tell Us About Text Complexity? Sanja Stajner, Richard Evans, Constantin Orˇ asan, and Ruslan

Corpus Kincaid Flesch Fog SMOG ch/w syl/w w/sSimpleWiki 7.49 69.91 10.35 9.78 4.67 1.43 16.05

News 9.39 64.98 12.28 10.77 4.66 1.43 20.90Health 7.84 69.31 10.83 10.07 4.63 1.4217.13*Fiction 5.05 83.06 7.85 7.90 4.29 1.30 13.58

Table 4: Readability indices and related features

of senses per word. Therefore, multiple occurrences ofthe same ambiguous word will increase the value of thisfeature.

4. ResultsThe study presented in this paper comprises three parts.In the first, a comparison is made between the valuesobtained for the four readability indices and the factors thatthey exploit (average numbers of characters and syllablesper word, average number of words per sentence) in theirassessment of the corpora (SimpleWiki, News, Health,and Fiction). If the intuitive assumption is valid, thatSimpleWiki represents a corpus of simplified texts (a“gold standard”), then this comparison will indicate howfar documents from the news, health, and fiction genres(important for the social inclusion of people with ASD) liefrom this ‘gold standard’.In the second part, the use of thirteen linguistic features isexplored. Ten of the linguistic features are based on thefrequency of occurrence of surface syntactic tags, one isbased on sentence complexity expressed in terms of thenumber of verb chains that they contain, another providesan approximation of the number of definite noun phrasesused in the text, and the final feature measures the averagelevel of semantic ambiguity of the words used in the text.The values obtained for these features for each of thecorpora are compared.In the third part of the study, potential correlationsbetween the linguistic features and readability metrics areinvestigated. The motivation for this lies in the fact thatextraction of the linguistic features is relatively expensiveand unreliable, while the computation of the readabilitymetrics is done automatically and with greater accuracy.The ability to estimate the accessibility of documentsfor people with ASD on the basis of easily computedreadability metrics rather than complex linguistic featureswould be of considerable benefit.The results obtained in these three parts of the study arepresented separately in the following sections.

4.1. ReadabilityThe results of the first part of this study are presentedin Table 4. The first row of the table contains thescores for these seven features obtained for SimpleWiki.For the other three text genres, the values of thesefeatures were calculated and a non-parametric statisticaltest (Kolmogorov-Smirnov Z test) was applied in orderto calculate the significance of the differences in meansbetween SimpleWiki and the corresponding text genre.8

8The Kolmogorov-Smirnov Z test was selected as a resultof prior application of the Shapiro-Wilk’s W test which

In Table 4, values which differ from those obtained for thedocuments in SimpleWiki at a 0.01 level of significance areprinted in bold. Those printed in bold with an asterisk differfrom those obtained from documents in SimpleWiki at a0.05, but not at a 0.01 level of significance.On the basis of these results it can be inferred that thenews texts are most difficult to read as they require ahigher level of literacy for their comprehension (the valuesof the Kincaid, Fog and SMOG indices are maximalfor this genre, while the Flesch index is at its lowestlevel, indicating that all are in accordance). Discrepanciesbetween the values of different indices are not surprising,as they use different variables and different criterion scores(DuBay, 2004). Also, it is known that the predictions madeby these formulae are not perfect, but are rough estimates (r= .50 to .84) of text difficulty. That is, they “account for 50to 84 percent fo the variance in text difficulty as measuredby comprehension tests” (DuBay, 2004). In the context ofthe current research, it is important that when the difficultyof two types of text is compared, consistent conclusions canbe made about which type is more difficult than the other,regardless of which readability formula is used (Table 4).It is interesting to note that none of the indicesindicate significant differences between the readability ofhealth documents and that of documents in SimpleWiki,suggesting that similar levels of literacy are necessaryfor their comprehension. Despite this, a slightly greateraverage sentence length was noted for the health texts thanthe texts from SimpleWiki. The most surprising findingwas that the fiction texts are reported by all readabilitymetrics (including average word and sentence length) tobe significantly less difficult than those from SimpleWiki(Table 4). These results cast doubt on the assumption thatSimpleWiki serves as a paradigm of accessibility, whichhas been made in previous work on text simplification (e.g.(Coster and Kauchak, 2011)).As it was observed that all readability indices returnedsimilar values in the comparison of different text genres(Table 4), the strength of correlation between them wasinvestigated. To this end, Pearson’s correlation wascalculated between each pair of the four indices (Table 6),over the whole corpora (SimpleWiki, News, Health, andFiction). Pearson’s correlation is a bivariate measure ofstrength of the relationship between two variables, whichcan vary from 0 (for a random relationship) to 1 (for aperfectly linear relationship) or -1 (for a perfectly negativelinear relationship). The results presented in Table 6indicate a very strong linear correlation between each pairof readability indices.

demonstrated that most of the features do not follow a normaldistribution.

Page 6: What Can Readability Measures Really Tell Us … · What Can Readability Measures Really Tell Us About Text Complexity? Sanja Stajner, Richard Evans, Constantin Orˇ asan, and Ruslan

Corpus V N Prep Det Adv Pron A CS CC INF Compl Senses defNPSimpleWiki 2.74 5.77 1.92 1.94 0.77 0.81 1.20 0.19 0.60 0.21 1.37 6.59 1.19

News 4.08 6.63 2.44 2.17 0.95 1.64 1.56 0.35 0.65* 0.38 0.53 6.73* 1.26*Health 3.40 5.22 1.82 1.51 0.99 1.38 1.63 0.32 1.01 0.35 1.15 6.73 0.73Fiction 2.95 3.33 1.43 1.35 1.10 1.89 0.90 0.23 0.49 0.23* 1.13* 7.59 0.77

Table 5: Linguistic features

r Kincaid Flesch Fog SMOGKincaid 1 −.959 .987 .951Flesch −.959 1 −.957 −.972

Fog .987 −.957 1 .979SMOG .951 −.972 .979 1

Table 6: Pearson’s correlation between readability indices

In the case of the Flesch index, the higher the score is, thelower the grade level necessary for understanding the giventext. For all other indices, a higher score indicates a highergrade level necessary to understand the text. Therefore, thecorrelation between the Flesch index and any other index isalways reported as negative. In order to confirm that thesecorrelations are not genre dependent, a set of experimentswas conducted to measure the correlation between thesefour readability measures separately for each of the fourcorpora (SimpleWiki, News, Health and Fiction). Thoseexperiments revealed a very close correlation between thefour readability indices (between .915 and .993) in eachgenre.

Figure 1: Distribution of the Flesch index

As the correlation between the four readability indiceswas reported to be almost perfectly linear (Table 6), thereminder of this study focuses on the Flesch index asa representative of the readability indices. The resultsdiscussed earlier (Table 4) presented only the mean value ofthe Flesch index for each of the corpora. In Figure 1, eachtext is represented separately, providing a more completepicture of the Flesch index distribution across the corpora.It can be noted that the mean value of the Flesch indexlies at approximately the same place on the x-axis for

both SimpleWiki and Health texts, which is in accordancewith the previously reported results (Table 4). Mean valueof the Flesch index in News genre slightly shifted to theleft relative to the SimpleWiki corresponds to lower textreadability in News genre than in SimpleWiki reported inTable 4. It can also be noted that the distribution of theFlesch index in the Fiction genre is positioned significantlyto the right relative to the SimpleWiki, thus indicating ahigher readability of texts in this genre.

4.2. Linguistic FeaturesThe investigation of the average occurrence of ten differentPOS tags per sentence (V9, N, Prep, Det, Adv, Pron, A, CS,CC, INF) and three other linguistically motivated features(Compl, Senses and defNP) showed significantly differentvalues in News, Health and Fiction than in SimpleWiki inmost of the cases (Table 5).Documents from SimpleWiki were found to contain thehighest ratio between simple and complex sentences(Compl), the lowest number of verbs (V), adverbs(Adv), pronouns (Pron), subordinating conjunctions (CS),infinitive markers (INF) and senses per word (Senses),which may reflect a certain simplicity of these texts.The News genre was found to contain thelowest ratioof simple to complex sentences (Compl), and thehighestnumber of verbs (V), subordinate conjunctions (CS) andinfinitive markers (INF) per sentence. These featuresindicate a greater number of verb chains (Compl) andsubordinate clauses (CS), longer verb chains and morecomplex verb constructions (V and INF) for news articles.These features can be considered indicators of syntacticcomplexity, which is probably reflected in the high scoresfor readability indices obtained in this genre (Table4). The texts from the genre of fiction containedthe smallest average number of nouns (N), prepositions(Prep), determiners (Det), adjectives (A) and coordinatingconjunctions (CC) per sentence (Table 5). However, thisgenre contained a significantly higher number of senses perword (Senses) than other genres.

4.3. Flesch vs. Linguistic FeaturesIn the third part of the study, potential correlationbetween the linguistic features and readability indiceswas investigated. The Flesch index was selected as arepresentative of readability indices (as all four readabilityindices were almost perfectly linearly correlated, selectionof an alternative readability index should not change theresults significantly). Pearson’s correlation between theinvestigated POS frequencies (on average per sentence)

9This tag includes the occurrence of present (ING) and pastparticiple (EN).

Page 7: What Can Readability Measures Really Tell Us … · What Can Readability Measures Really Tell Us About Text Complexity? Sanja Stajner, Richard Evans, Constantin Orˇ asan, and Ruslan

Corpus V N Prep Det Adv Pron A CS CC INFall −.493 −.812 −.777 −.715 −.093* .189 −.769 −.377 −.464 −.415

SimpleWiki −.397 −.552 −.641 −.545 −.293 .136 −.685 −.130 −.424 −.118News −.385 −.738 −.759 −.705 −.197 .291 −.783 −.438 −.387 −.426Health −.274 −.743 −.607 −.489 −.104 .078 −.703 .014 −.610 −.139Fiction −.605 −.889 −.854 −.851 −.555 −.146 −.876 −.515 −.670 −.506

Table 7: Pearson’s correlation between Flesch readabilityindex and POS frequencies

Corpus Compl ch/w syl/w w/s Senses defNPAll .210 −.859 −.922 −.792 .627 −.595

SimpleWiki .209 −.825 −.921 −.643 .452 −.337News −.026 −.866 −.919 −.762 .568 −.688Health 0.034 −.771 −.918 −.705 .417 −.450Fiction .376 −.790 −.877 −.822 .738 −.791

Table 8: Pearson’s correlation between Flesch readabilityindex and other features

and the Flesch index is presented in Table 7, while thecorrelation between the other six features and the Fleschindex is reported in Table 8. These experiments wereconducted first for all the corpora and then for each corpusseparately in order to determine whether these correlationsmay be genre dependent.As would be expected, the direction of correlation (sign‘−’ or ‘+’) is independent of genre (in those cases wherethe correlation is statistically significant and thus morereliable). However, the strength of the correlation doesdepend of the genre of the texts, e.g. correlation betweenaverage number of verbs per sentence (V) and the Fleschindex is−.274 for the Health and−.605 for the Fictiongenres. The ‘−’ sign indicates that if the value of thefeature increases, the Flesch index decreases (indicatinga less readable text) and vice-versa (as the Pearson’scorrelation is a symmetric function we are not able to say inwhich direction the correlation goes). The results presentedin Tables 7 and 8 therefore indicate that for most of thefeatures (V, N, Prep, Det, Adv, A, CS, CC, INF, ch/w, w/s,defNP) the lower the feature value for a given text, theeasier that text is to read (the higher the Flesch index). Forfeature Compl, the results also support the intuition that thehigher the ratio of simple to complex sentences is in thetext, the more readable it is (higher Flesch index).The most surprising results were those obtained for thefeature Senses (Table 8), which indicate that the higher theaverage number of senses per word in the text, the morereadable the text is. One possible hypothesis that emergesfrom this observation is that shorter words in English tendto be more semantically ambiguous than longer words (thereadability indices are highly correlated with word length,measured both in characters and syllables per word, withthe occurrence of shorter words suggesting that the text iseasier to read).

5. ConclusionsThere are several important findings of this study. First,it was shown that the four well-known readability indicesare almost perfectly linearly correlated on each of the fourinvestigated text genres – SimpleWiki, News, Health, and

Fiction. Furthermore, our results indicated that texts fromthe genre of fiction are simpler than those selected fromSimpleWiki in terms of the readability indices, castingdoubt on the assumption that SimpleWiki is a useful sourceof documents to form a gold standard of accessibilityfor people with reading difficulties. Application of themeasures also indicated that news articles are most difficultto read, relative to the other genres, requiring a higher levelof literacy for their comprehension.The results of the second part of our study (investigation ofvarious linguistic features) revealed that documents fromSimpleWiki were the simplest of the four corpora in termsof several linguistic features – average number of verbs,adverbs, pronouns, subordinate conjunctions, infinitivemarkers, number of different word senses and ratio betweensimple and complex sentences. They also indicated some ofthe factors that may make news texts difficult to read, e.g.containing the highest numbers of verbs and subordinateconjunctions per sentence, and the lowest ratio of simple tocomplex sentences.The results of the third set of experiments indicated theaverage length of words (in characters and in syllables) asbeing features with the highest correlation to the Fleschindex. They also indicated that features such as the averagenumber of nouns, prepositions, determiners and adjectivesare closely correlated with the Flesch index (up to .89in the fiction genre), which supports the idea of usingreadability indices as an initial measure of text complexityin our project. The comparison of these correlationsacross different text genres demonstrated that they are genredependent and that the correlation between these linguisticfeatures and the Flesch index is closest for the Fictiongenre.

6. AcknowledgementsThe research described in this paper was partially fundedby the European Commission under the Seventh (FP7- 2007-2013) Framework Programme for Research andTechnological Development (FIRST 287607). Thispublication [communication] reflects the views only of theauthors, and the Commission cannot be held responsible for

Page 8: What Can Readability Measures Really Tell Us … · What Can Readability Measures Really Tell Us About Text Complexity? Sanja Stajner, Richard Evans, Constantin Orˇ asan, and Ruslan

any use which may be made of the information containedtherein.

7. ReferencesJ. R. Bormuth. 1966. Readability: A new approach.

Reading research quarterly, 1:79–132.W. Bosma, 2005. Image retrieval supports multimedia

authoring, pages 89–94. ITC-irst, Trento, Italy.L. Burnard. 1995.Users Reference Guide British National

Corpus Version 1.0. Oxford University ComputingServices, UK.

N. Chomsky. 1986.Knowledge of language: its nature,origin, and use. Greenwood Publishing Group, SantaBarbara, California.

E. B. Coleman, 1965.On understanding prose: somedeterminers of its complexity. National ScienceFoundation, Washington, D.C.

E. B. Coleman, 1971.Developing a technology of writteninstruction: some determiners of the complexity of prose.Teachers College Press, Columbia University, New York.

W. Coster and D. Kauchak. 2011. Simple englishwikipedia: A new text simplification task. InProceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics (ACL-2011),pages 665–669, Portland, Oregon, June. Association ofComputational Linguistics.

C. C. Doak, L. G. Doak, and J. H. Root. 1996.Teachingpatients with low literacy skills. J. B. LippincottCompany, Philadelphia.

W. H. DuBay. 2004.The Principles of Readability. ImpactInformation, Costa Mesa.

G. Escudero, L. Marquez, and G. Rigau. 2000. Acomparison between supervised learning algorithmsfor word sense disambiguation. In C. Cardie,W. Daelemans, C. Nedellec, and E. Tjong KimSang, editors,Proceedings of the Fourth ComputationalNatural Language Learning Workship, CoNLL-2000,pages 31–36, Lisbon, Portugal, September. Associationof Computational Linguistics.

R. Evans. 2011. Comparing methods for the syntacticsimplification of sentences in information extraction.Literary and Linguistic Computing, 26 (4):371–388.

C. Fellbaum. 1998. WordNet An Electronic LexicalDatabase. MIT Press, Cambridge, MA.

R. Flesch. 1949.The art of readable writing. Harper, NewYork.

U. Frith and M. Snowling. 1983. Reading for meaningand reading for sound in autistic and dyslexic children.Journal of Developmental Psychology, 1:329–342.

R. Gaizauskas, J. Foster, Y. Wilks, J. Arundel, P. Clough,and S. Piao. 2001. The Meter corpus: A corpusfor analysing journalistic text reuse. InProceedingsof Corpus Linguistics 2001 Conference, pages 214–223. Lancaster University Centre for Computer CorpusResearch on Language.

W. S. Gray and B. Leary. 1935.What makes a bookreadable. Chicago University Press, Chicago.

R. Gunning. 1952. The technique of clear writing.McGraw-Hill, New York.

R. P. Kern. 2004.Usefulness of readability formulas forachieving Army readability objectives: Research andstate-of-the-art applied to the Army’s problem (NTIS No.AD A086 408/2). U.S. Army Research Institute, FortBenjamin Harrison.

J. P. Kincaid, R. P. Fishburne, R. L. Rogers, and B. S.Chissom. 1986.Derivation of new readability formulas(Automatic Readability Index, Fog Count and FleschReading Ease Formula) for Navy enlisted personnel.CNTECHTRA.

G. H. McLaughlin. 1969. SMOG grading – a newreadability formula.Journal of reading, 22:639–646.

N. Minshew and G. Goldstein. 1998. Autism as a disorderof complex information processing.Mental Retardationand Developmental Disability Research Review, 4:129–136.

R. Mitkov. 2002. Anaphora Resolution. Longman,Harlow, Essex.

K. Nation, P. Clarke, B. Wright, and C. Williams. 2006.Patterns of reading ability in children with autism-spectrum disorder.Journal of Autism & DevelopmentalDisorders, 36:911–919.

NCES. 1993. Adult literacy in America. NationalCenter for Education Statistics, U.S. Dept. of Education,Washington, D.C.

I. M. O’Connor and P. D. Klein. 2004. Explorationof strategies for facilitating the reading comprehensionof high-functioning students with autism spectrumdisorders. Journal of Autism and DevelopmentalDisorders, 34:2:115–127.

C. Orasan and L. Hasler. 2007. Computer-aidedsummarisation: how much does it really help? InProceedings of Recent Advances in Natural LanguageProcessing (RANLP 2007), pages 437–444, Borovets,Bulgaria, September.

A. Siddharthan. 2006. Syntactic simplification and textcohesion. Research on Language and Computation,4:1:77–109.

P. Tapanainen and T. Jarvinen. 1997. A non-projectivedependency parser. InProceedings of the 5th conferenceon Applied Natural Language Processing of theAssociation for Computational Linguistics, pages 64–71.Association of Computational Linguistics.

B. D. Weiss and C. Coyne. 1997. The use of usermodelling to guide inference and learning.New EnglandJournal of Medicine, 337 (4):272–274.

M. Yatskar, B. Pang, C. Danescu-Niculescu-Mizil, andL. Lee. 2010. For the sake of simplicity: Unsupervisedextraction of lexical simplifications from wikipedia. InProceedings of Human Language Technologies: The2010 Annual Conference of the North American Chapterof the ACL, pages 365–368, Los Angeles, California,June. Association of Computational Linguistics.