anselmo pen˜as - university of limerick · received a strong boost. the aim of the trec qa...

41
ORIGINAL PAPER Question answering at the cross-language evaluation forum 2003–2010 Anselmo Pen ˜as Bernardo Magnini Pamela Forner Richard Sutcliffe A ´ lvaro Rodrigo Danilo Giampiccolo Ó Springer Science+Business Media B.V. 2012 Abstract The paper offers an overview of the key issues raised during the 8 years’ activity of the Multilingual Question Answering Track at the Cross Language Evaluation Forum (CLEF). The general aim of the track has been to test both monolingual and cross-language Question Answering (QA) systems that process queries and documents in several European languages, also drawing attention to a number of challenging issues for research in multilingual QA. The paper gives a brief description of how the task has evolved over the years and of the way in which the data sets have been created, presenting also a short summary of the different types of questions developed. The document collections adopted in the competitions are outlined as well, and data about participation is provided. Moreover, the main A. Pen ˜as A ´ . Rodrigo NLP & IR Group, UNED, Madrid, Spain e-mail: [email protected] A ´ . Rodrigo e-mail: [email protected] B. Magnini Fondazione Bruno Kessler (FBK-irst), Trento, Italy e-mail: [email protected] P. Forner (&) D. Giampiccolo Center for the Evaluation of Language and Communication Technologies (CELCT), Trento, Italy e-mail: [email protected] D. Giampiccolo e-mail: [email protected] R. Sutcliffe University of Limerick, Limerick, Ireland e-mail: [email protected] 123 Lang Resources & Evaluation DOI 10.1007/s10579-012-9177-0

Upload: phungkien

Post on 13-Aug-2019

216 views

Category:

Documents


0 download

TRANSCRIPT

ORI GIN AL PA PER

Question answering at the cross-language evaluationforum 2003–2010

Anselmo Penas • Bernardo Magnini • Pamela Forner • Richard Sutcliffe •

Alvaro Rodrigo • Danilo Giampiccolo

� Springer Science+Business Media B.V. 2012

Abstract The paper offers an overview of the key issues raised during the 8 years’

activity of the Multilingual Question Answering Track at the Cross Language

Evaluation Forum (CLEF). The general aim of the track has been to test both

monolingual and cross-language Question Answering (QA) systems that process

queries and documents in several European languages, also drawing attention to a

number of challenging issues for research in multilingual QA. The paper gives a

brief description of how the task has evolved over the years and of the way in which

the data sets have been created, presenting also a short summary of the different

types of questions developed. The document collections adopted in the competitions

are outlined as well, and data about participation is provided. Moreover, the main

A. Penas � A. Rodrigo

NLP & IR Group, UNED, Madrid, Spain

e-mail: [email protected]

A. Rodrigo

e-mail: [email protected]

B. Magnini

Fondazione Bruno Kessler (FBK-irst), Trento, Italy

e-mail: [email protected]

P. Forner (&) � D. Giampiccolo

Center for the Evaluation of Language and Communication Technologies (CELCT), Trento, Italy

e-mail: [email protected]

D. Giampiccolo

e-mail: [email protected]

R. Sutcliffe

University of Limerick, Limerick, Ireland

e-mail: [email protected]

123

Lang Resources & Evaluation

DOI 10.1007/s10579-012-9177-0

measures used to evaluate system performances are explained and an overall

analysis of the results achieved is presented.

Keywords Question answering � Evaluation � CLEF

1 Introduction

Under the promotion of the TREC-8 (Voorhees and Tice 1999) and TREC-9

(Voorhees 2000) Question Answering tracks, research in Question Answering (QA)

received a strong boost. The aim of the TREC QA campaigns was to assess the

capability of systems to return exact answers to open-domain English questions. The

QA track at TREC represented the first attempt to emphasise the importance and

foster research on systems that could extract relevant and precise information rather

than documents. QA systems are designed to find answers to open domain questions

in a large collection of documents and the development of such systems has

acquired an important status among the scientific community because it entails

research in both Natural Language Processing (NLP) and Information Retrieval

(IR), putting the two disciplines in contact. In contrast to the IR scenario, a QA

system processes questions formulated into natural language (instead of keyword-

based queries) and retrieves answers (instead of documents). During the years at

TREC from 1999 to 2007 and under the TAC conference in 2008, the task has

evolved, providing advancements and evaluation evidence for a number of key

aspects in QA, including answering factual and definition questions, questions

requiring complex analysis, follow-up questions in a dialog-like context, and mining

answers from different text genres, including blogs. However, despite the great deal

of attention that QA received at TREC, multilinguality was outside the mainstream

of QA research.

Multilingual QA emerged as a complementary research task, representing a

promising direction for at least two reasons. First, it allowed users to interact with

machines in their native languages, contributing to easier, faster, and more equal

information access. Second, cross-lingual capabilities enabled QA systems to access

information stored only in language-specific text collections.

Since 2003, a multilingual question answering track has been carried out at the

Cross-Language Evaluation Forum (CLEF).1 The introduction of multilinguality

represented not only a great novelty in the QA research field, but also a good chance

to stimulate the QA community to develop and evaluate multilingual systems.

During the years, the effort of the organisers was focused on two main issues.

One aim was to offer an evaluation exercise characterised by cross-linguality,

covering as many languages as possible. From this perspective, major attention was

given to European languages, adding at least one new language each year. However,

the offer was also kept open to languages from all over the world, as the inclusion of

Indonesian shows.

1 http://www.clef-campaign.org.

A. Penas et al.

123

The other important issue was to maintain a balance between the established

procedure—inherited from the TREC campaigns—and innovation. This allowed

newcomers to join the competition and, at the same time, offered ‘‘veterans’’ more

challenges.

This paper is organised as follows: Sect. 2 outlines the outcomes and the lessons

learned in 8 years of CLEF campaigns; Sect. 3 gives a brief description of how the

task has evolved over the years, and the of way in which the data sets were created,

and presents the document collections adopted and data about participation.

Section 4 gives a short explanation of the different measures adopted to evaluate

system performance. In Sect. 5, annual results are discussed highlighting some

important features. In Sect. 6 the main techniques adopted by participants are

described. Sect. 7 addresses some relevant research directions on QA which have

been explored in the last years outside the scope of QA at CLEF. Finally, in Sect. 8

some conclusions are drawn. In ‘‘Appendix’’ a brief overview of the different types

of question developed is also given.

2 Outcomes and lessons learned

The main outcomes of the Question Answering Track along these years

(2003–2010) are:

1. Development of reusable benchmarks in several languages. Although it is not

possible to compare different systems across languages, developers can

compare their systems across languages thanks to the use of comparable and

parallel document collections, and parallel translations of all questions into

many different languages.

2. Development of the methodologies for creating these multilingual benchmarks.

3. Diversity of types of questions (all of them are classified in the available

resources) and diversity of collections (from newswire or Wikipedia to

legislative texts).

4. A general methodology for QA evaluation. This methodology has

evolved thanks to the output generated for many pilot exercises attached to

the track.

During these years, some lessons attached to the goals of each particular

campaign have been learned. From the very beginning in 2003, the track had a

strong focus on multilinguaility and tried to promote the development of

translingual systems. Despite all the efforts made in this direction—translating

questions in many different languages and using comparable and parallel

corpora—systems targeting different languages cannot be strictly compared and

no definite conclusions can be drawn. Nevertheless, the resources developed

allow the comparison of the same system across different languages, which is

very important for QA developers that work in several languages, as the

performances of different systems targeting the same language can be assessed

comparatively.

Cross-language evaluation forum 2003–2010

123

The final methodology was implemented in 2009 and 2010 (Penas et al. 2009,

2010), where both questions and documents had parallel translations. Thus, the

systems that participated in several languages served as reference points for

comparison across languages.

Another lesson learned concerned how the evaluation setting determines the

participant systems architecture. By 2005 it became clear that there was an upper

bound of 60% of accuracy in systems performance, although more than 80% of the

questions were answered by at least one participant. It emerged that there was a

problem of error propagation in the most used QA pipeline (Question Analysis,

Retrieval, Answer Extraction, Answer Selection/Validation). Thus, in 2006 a pilot

task called Answer Validation Exercise (AVE) was proposed, aimed at fostering a

change in QA architectures by giving more relevance to the validation step (Penas

et al. 2006). In AVE, the assumption was that after a preliminary step of hypothesis

over-generation, the validation step decides whether the candidate answer is correct

or not. This is a kind of classification task that could take advantage of Machine

Learning. The same idea is behind the architecture of IBM’s Watson (DeepQA

project) that successfully participated at Jeopardy (Ferrucci et al. 2010).

After the three campaigns of AVE an attempt was made to transfer the conclusions

to the QA main task at CLEF 2009 and 2010. The first step was to introduce the option

of leaving questions unanswered, which is related to the development of validation

technologies necessary to develop better QA systems. A suitable measure was also

needed, which was able to reward systems that reduce the number of questions

answered incorrectly without affecting system accuracy, by leaving unanswered those

questions whose answers the system is not confident about. The measure was an

extension of accuracy called c@1 (Penas and Rodrigo 2011), tested during 2009 and

2010 QA campaigns at CLEF, and used also in subsequent evaluations.

However, this was not the change in the architecture that was expected, as almost

all systems continued using indexing techniques to retrieve relevant passages and

tried to extract the exact answer from that. Moreover results did not go beyond the

60% pipeline upper bound.

Therefore, the conclusion was that, in order to foster a real change in the QA system

architecture, a previous development of answer validation/selection technologies was

required. For this reason, the new formulation of the task after 2010 leaves the retrieval

step aside to focus on the development of technologies able to work with a single

document, answering questions about it and using the reference collections as sources

of background knowledge that help the answering process.

3 Track evolution

The QA task since 2003 up to 2010 consisted of taking a short question and a

document collection as input and producing an exact answer as output. In the

Multiple Language QA Main Task at CLEF, the systems were fed with a set of

questions and were asked to return one or more exact answers per question—where

exact means that neither more nor less than the information required must be

returned.

A. Penas et al.

123

In all the campaigns, the QA track was structured in both monolingual and

bilingual tasks. The success of the track showed an increasing interest in both

monolingual non-English QA—where questions and answers were in the same

language—and in cross-lingual QA—where the question was posed in a language

and the answer must be found in a collection of a different language. Until 2009, the

target collections consisted of newspaper articles, which were comparable but not

parallel and, as a consequence, the answer might be present in more than one

language collection, even though not in all. On the contrary, in 2009 and 2010

campaigns a parallel aligned corpus was used, which made the task completely

multilingual, i.e. questions had an answer in all target languages.

Tables 1 and 2 summarise all the novelties that have been introduced in the main

task over the years, in order to make the exercise more challenging and realistic.

3.1 Task and question types

In 2003 (Magnini et al. 2003) consisted of returning automatically—i.e. with no

manual intervention—a ranked list of [docid, answer] pairs per question such that

the retrieved document supported the answer. Participants were given 200 questions

for each language sub-task, and were allowed to submit up to three responses per

query. They were asked to retrieve either a 50-byte snippet of text extracted from

the document collections, which provided exactly the amount of information

required, or an exact answer. Each returned run consisted either of entirely 50-byte

answers or exact answers, but not a mixture. Twenty questions had no known

answer in the target corpora: systems indicated their confidence that there was no

answer in the document collection by returning ‘‘NIL’’ instead of the [docid,

answer] pair. There was general agreement about their usefulness in assessing the

systems’ performances, so a certain number of NIL questions were created in all QA

campaigns until 2008. In the first year of the track, only Factoid questions were

considered, i.e. fact-based questions, asking for the name of a person, a location, the

extent of something, the day on which something happened, etc. Participants were

not required to return a supporting context for their answer until 2006. ‘‘Appendix’’

includes examples showing contexts (with the document ID in brackets) to illustrate

the source of the answer given for all the different question types used along these

years (see summary in Table 2).

In 2004 (Magnini et al. 2004), the task was repeated without changes but for the

addition of four new languages, and two new question types: Definition and a new

answer type for Factoid, namely Manner.

Despite the demand for radical innovation, a conservative approach was also

preferred in 2005 (Vallin et al. 2005), as the procedures consolidated in the last two

campaigns seemed to need further investigation before moving to the next stage.

Although the task remained basically the same as that of 2004, some minor changes

were made: the question types Manner and Object were discontinued and, at the

same time, the concept of Temporal Restriction was introduced. This was the

property of restricting answers to a given question (of any type) to those which were

valid only when associated with an event, when occurring on a particular date, or

Cross-language evaluation forum 2003–2010

123

Ta

ble

1C

oll

ecti

ons,

ques

tions

and

answ

erst

yle

sat

CL

EF

cam

pai

gns

1994

new

spap

ers

1995

new

spap

ers

2006

Wik

iped

ias

JRC

Acq

uis

EU

RO

PA

RL

Tar

get

Lan

guag

es

Num

ber

of

ques

tions

Ques

tions

Gro

uped

50

Byte

answ

er

Exac

t

answ

er

Par

agra

ph

answ

er

NIL

answ

er

Support

ing

docu

men

t

Support

ing

snip

pet

Support

ing

par

agra

ph

2003

Y3

200

YY

YY

2004

YY

7200

YY

Y

2005

YY

8200

YY

Y

2006

YY

9200

YY

Y

2007

YY

Y10

200

YY

YY

2008

YY

Y11

200

YY

YY

2009

Y9

500

Y

2010

YY

7200

YY

Y

A. Penas et al.

123

Tab

le2

Ques

tion

types

atC

LE

Fca

mpai

gns

Count

Defi

nit

ion

Lis

tL

oca

tion

Man

ner

Mea

sure

Obje

ctO

pin

ion

Org

anis

atio

nO

ther

Per

son

Pro

cedure

Purp

ose

Rea

son

Rea

son-

purp

ose

Tim

eT

empora

l

rest

rict

ion

2003

YY

YY

YY

Y

2004

YY

YY

YY

YY

Y

2005

YY

YY

YY

YY

2006

YY

YY

YY

YY

Y

2007

YY

YY

YY

YY

YY

Y

2008

YY

YY

YY

YY

YY

Y

2009

YY

YY

YY

YY

YY

YY

YY

2010

YY

YY

YY

YY

YY

YY

YY

Cross-language evaluation forum 2003–2010

123

when taking place within a time interval. Temporal restrictions have since been used

in a subset of CLEF questions in all years up until the present.

In 2006 (Magnini et al. 2006), the most significant innovation was the

introduction of List questions, which had also been considered for previous

competitions, but had been avoided due to the issues that their selection and

assessment implied. In contrast to TREC where each answer was listed as a

separate, self contained response to the question, at CLEF the list was contained

within a single response; this means that the answer was found in one passage of the

document set that spelled out the entire list. Under this aspect, these single response

List questions did not differ from a traditional Factoid question. Moreover, such

questions could require either ‘‘closed lists’’ as answers, consisting in a number of

specified items, or ‘‘open lists’’, where an unspecified number of correct answers

could be returned. In case of closed lists, correct partial answers, where only some

of the expected items were present, were evaluated as inexact. This kind of

questions was introduced in order to allow a multilingual investigation of List

questions without requiring a separate evaluation procedure.

Other important innovations of the 2006 campaign were the possibility to return

up to ten exact answers per question, and the requirement to additionally provide up

to ten text snippets—i.e. substrings of the specified documents giving the actual

context of the exact answer in order to justify it.

In 2007, the questions were grouped into clusters, each of which referred to the

same topic. This meant that co-reference could be used between entities mentioned

in questions—a well known phenomenon within NLP which nevertheless had not

been considered in previous QA exercises at CLEF. In these cases, the supporting

document for the second answer could be not the same as that for the first answer.

Another major novelty for 2007 concerned the documents. Up to 2006, each data

collection comprised a set of newspaper articles provided by ELRA/ELDA (see

Table 3). Then, in 2007, Wikipedia dated 2006 was used as well, capitalising on the

experience of the WiQA pilot task (Jijkoun and de Rijke 2007). Thus, for example,

the answer to a question in French could be found in a French newspaper article (as

in previous years), in a French Wikipedia entry, or both. One of the main reasons for

using the Wikipedia collections was to make a first step towards Web-formatted

corpora; as a huge amount of information was available on the Web, this was

considered a desirable next level in the evolution of QA systems.

The 2007 task proved to be much more difficult than expected because of the

grouped questions. Not only did groups include co-reference (See Example 9 in the

‘‘Appendix’’) but, in addition, the questions became intrinsically more complicated

because they were no longer semantically self-contained, as the simple factoids of

earlier campaigns had been. Instead, they effectively developed a theme cumula-

tively. In order to allow participants more time to further study this problem, the

exercise was repeated almost without changes in 2008.

The 2009 evaluation track, called ResPubliQA, represented a radical change with

respect to the previous QA campaigns at CLEF. The exercise was aimed at

retrieving answers to a set of 500 questions. The required output was not an exact

answer but an entire paragraph, and the collection—JRC-Acquis—was from a

specific domain, i.e. European legislation. Moreover, three new questions types

A. Penas et al.

123

were introduced, in an attempt to move away from the factoid paradigm—

Procedure, Purpose and Reason. Finally, the choice of a specific domain represented

a first step towards the definition of a realistic user model. The issue of identifying

potential users of QA systems had been a matter of discussion among the track

organizers for long, but in the campaigns held so far, the focus was on proposing a

general task in order to allow systems to perfect the existing techniques. In 2009,

time seamed ripe to make the task more realistic and introduce a user model. While

looking for a suitable context, improving the efficacy of legal searches in the real

world seemed an approachable field, as the retrieval of information from legal texts

was an issue of increasing importance given the vast amount of data which had

become available in electronic form in the previous years.

The design of the ResPubliQA 2010 evaluation campaign was to a large extent a

repetition of the previous year’s exercise. However, this year participants had the

opportunity to return both paragraph and exact answers as system output. Another

novelty was the addition of a portion of the EuroParl collection which contains

transcribed speeches from the European Parliament. Moreover, Reason and Purpose

questions, which had been found to be too similar to one another, were duly merged

into one category, Reason-Purpose. At the same time, two new question types were

introduced, Other and Opinion. In the case of the latter, it was thought that speeches

within EuroParl might express interesting opinions.

3.2 Multilingual question sets

The procedure for generating questions did not significantly change over the years.

For each target language, a number of questions (ranging from 100 to 200

depending on the campaign) were manually produced, initially using the topics of

the Ad-Hoc track at CLEF. The use of topics was originally introduced to reduce the

number of duplicates in the multilingual question set. Together with the questions, a

Gold Standard was also produced, by manually searching for at least one answer in

a document collection. The questions were then translated into English, which acted

as lingua franca, so that they could be understood and reused by all the other groups.

Once the questions were collected in a common format, native speakers of each

source language, with a good command of English, were recruited to translate the

English version of all questions into their own languages, trying to adhere as much

as possible to the original.

The introduction of back translation to create cross-lingual question–answer

pairs—a paradigm developed in 2003 and used ever since—is one of the most

remarkable features of QA at CLEF.

In 2007 (Giampiccolo et al. 2007), with the introduction of topic-related

questions, the procedure followed to prepare the test set changed slightly. First of

all, each organising group, responsible for a target language, freely chose a number

of topics. For each topic, one to four questions were generated. The topic-related

questions consisted of clusters of questions which referred to the same topic. The

requirement for related questions on a topic necessarily implies that the questions

refer to common concepts and entities within the domain in question. Unlike in the

Cross-language evaluation forum 2003–2010

123

Table 3 Document collections used in the CLEF campaigns

Collection 2003 2004 2005 2006 2007 2008 2009 2010

Sega [BG] (2002) x x x x

Standart [BG] (2002) x x x x

Novinar [BG] (2002) x

Frankfurter Rundschau [DE]

(1994)

x x x x x

Der Spiegel [DE] (1994–1995) x x x x x

German SDA [DE] (1994) x x x x x

German SDA [DE] (1995) x x x x x

The Southeast European

Times [EL] (2002)

x

Los Angeles Times

[EN] (1994)

x x x x x x

Glasgow Herald

[EN] (1995)

x x x x x

EFE [ES] (1994) x x x x x x

EFE [ES] (1995) x x x x x

Egunkaria [EU] (2001–2003) x

Aamulehti [FI] (1994–1995) x

Le Monde [FR] (1994) x x x x x

Le Monde [FR] (1995) x x x

French SDA [FR] (1994) x x x x x

French SDA [FR] (1995) x x x x x

La Stampa [IT] (1994) x x x x x x

Italian SDA [IT] (1994) x x x x x x

Italian SDA [IT] (1995) x x x x x

NRC Handelsblad [NL]

(1994–1995)

x x x x x x

Algemeen Dagblad [NL]

(1994–1995)

x x x x x x

Publico [PT] (1994) x x x x x

Publico [PT] (1995) x x x x x

Folha de Sao Paulo

[PT] (1994)

x x x

Folha de Sao Paulo

[PT] (1995)

x x x

Wikipedia (BG) (DE) (EN)

(ES) (FR) (IT) (NL)

(PT) (RO) (Nov. 2006)

x x

Subset of JRC-Acquis (BG)

(DE)

(EN) (ES) (FR) (IT)

(PT) (RO)

x x

Subset of Europarl (DE) (EN)

(ES) (FR) (IT) (PT) (RO)

x

A. Penas et al.

123

previous campaigns, topics could be not only named entities or events, but also

other categories such as objects, natural phenomena, etc. Topics were not given in

the test set, but could be inferred from the first question/answer pair.

For the ResPubliQA tasks in 2009 and 2010, the questions were once again

ungrouped. The collection was also changed (see next section) but the same

principle of back-translation was used to create a completely parallel set of

questions, identical in all source languages.

3.3 Document collections

Before 2009, the target corpora in all languages, released by ELRA/ELDA,

consisted of large, unstructured, open-domain text collections. The texts were

SGML tagged and each document had a unique identifier (docid) that systems had to

return together with the answer, in order to support it. As regards the sources of

these collections, they remained practically unchanged during the years. Table 3

gives an overview of all the collections used in the QA campaigns.

In the first QA exercise, where only three languages were considered, the

collections were taken from news of 1994 and 1995. In the following year, the

number of languages increased and new collections from news sources were added

for each language, all covering the same time span, i.e. 1994–1995.

On the one hand, the fact that the newspaper and news agency articles referred to

the same period of time, with the exception of Bulgarian, assured that a certain

number of topics in the documents were shared in the different collections, making

them comparable, at least to some degree. On the other hand, the collections were

not really homogenous, and, what is more important, were of different size, ranging

from a minimum of 69,195 documents (213 MB) for Bulgarian to 454,045

documents (1,086 MB) for Spanish, which implied that the systems had to deal with

considerably different amount of data depending on the language of the task they

had to perform.

To reduce the difference between collections, and improve the comparability of

systems’ performances, the necessity of adopting other collections was debated for a

long time, but copyright issues represented a major obstacle.

A step towards a possible solution was made by the proposal of the WiQA pilot

task, which represented a first attempt to set the QA competitions in their natural

context, i.e. the Internet. An important advantage of Wikipedia was that it was

freely available in all languages considered, and presented a fairly high number of

entries containing comparable information. As this new source of data appeared to

be a promising field to explore in the attempt to gain a larger comparability among

languages, Wikipedia corpora were added in 2007. ‘‘Snapshots’’ of Wikipedia pages

for each language, as found in the November 2006 version, were made available for

download both in XML and HTML versions. However, the significant variations in

the size of the Wikipedia data in the different languages still represented a major

shortcoming, as the misalignment of the information about the same topic made it

difficult to create questions which could have answers in all the languages of the

competition, and to balance up questions by type across languages.

Cross-language evaluation forum 2003–2010

123

A final approach to the problem of data comparability was attempted in 2009,

when a subset of the JRC-Acquis Multilingual Parallel Corpus was used. JRC-

Acquis2 is a freely available parallel corpus of European Union (EU) documents,

mostly of a legal nature, covering various subject domains, such as economy, health,

information technology, law, agriculture, food, and politics. This collection of

legislative documents offered the opportunity to test QA systems on the same set of

questions in all the languages—allowing a real comparison of the performances—

and represented a change from the news domain to the legal domain.

As the ResPubliQA task was repeated in 2010, a subset of JRC-Acquis was used

again, together with a subset of EuroParl3—a collection of the Proceedings of the

European Parliament dating back to 1996—in order to assure a wider variety of

questions and make the exercise more challenging.

3.4 Participation and languages involved

The first years of the QA evaluation exercises at CLEF registered a steady increment

not only in the number of participants, but also of the languages involved, which is

encouraging, as multi-linguality is one of the main aims of these exercises. From

2007 on, the number of participants started to decrease, presumably because the task

underwent major modifications which made it more challenging. Anyway, the

number of languages involved in the exercise remained stable, as new languages

were added replacing others which were no longer adopted. It is worth noticing that

participants seemed to be less and less inclined to carry out cross-lingual tasks,

especially in the last two campaigns. Table 4 gives an overview of participation,

languages and runs, showing at a glance how the exercise has evolved during

8 years of QA campaigns.

When the track was proposed for the first time in 2003, eight tasks were set up—

three monolingual and five bilingual—and eight groups from Europe and North

America participated in seven tasks. The details of the distribution between

monolingual and bilingual tasks in all QA campaigns is shown in Table 5.

In 2004, the CLEF QA community grew significantly, as the spectrum of the

languages widened. In fact, nine source languages and 7 target languages were

exploited to set up more than 50 tasks, both monolingual and bilingual. The

monolingual English task was not offered, as it appeared to have been sufficiently

investigated at TREC, a policy retained in the following campaigns, until 2009.

The response of the participants was very positive, and eighteen groups—twice

as many as in the previous year—tested their systems in the QA exercise, submitting

a total of 48 runs.

In 2005, the positive trend in terms of participation was confirmed, as the number

of participants rose to 24, and 67 runs were submitted. The addition of Indonesian

introduced for the first time a non-European language in the task, enhancing the

multilingual character of the exercise, and experimenting with cross-linguality

involving languages outside the European boundaries.

2 http://wt.jrc.it/lt/Acquis/.3 http://www.europarl.europa.eu.

A. Penas et al.

123

In 2006, there were 30 participants, more than in any year before or since. Eleven

languages were considered both as source and target, but for Indonesian, Polish and

Romanian, which had no corpus to which address the questions. In these cases,

cross-language tasks where activated with English as a target language, by

translating the set of questions from English, used as lingua franca, into Indonesian,

Romanian and Polish. In the end, 24 tasks were proposed, divided into 7

monolingual and 17 cross-lingual tasks.

After years of constant growth, the number of participants decreased in 2007,

probably due to the new challenges introduced in the exercise, which may have

discouraged some potential participants. The language setting was the same as in the

previous year, except for Polish, which was not considered in this campaign. Eight

monolingual and 29 cross-lingual tasks were enabled. Unfortunately, the number of

submitted runs declined significantly, decreasing from a total of 77 registered in the

previous campaign to 37.

In 2008, two new European languages were added to the source languages

considered in the previous year, namely Greek and Basque, meanwhile Indonesian

was discontinued. Ten monolingual and 33 bilingual tasks were set up. Although the

number of participants remained almost the same as in 2007, the number of

submitted runs increased from 37 to 51.

The 2009 campaign involved experiments with a new document collection and a

new domain, and participation further decreased probably due to the new challenges

introduced. The languages considered were the same as in the previous year, but for

Greek, which was not proposed again. All the combinations between languages

were enabled except for Basque, which was exploited only as a source language.

Moreover, the monolingual English task, traditionally not included in the exercise,

was also proposed. Eleven groups participated, submitting 28 runs, all of which

were monolingual, with the exception of two Basque-English runs. This probably is

due to the configuration of the task: the fact that the sets of questions had answers in

each of the parallel-aligned collections did not motivate to search for a response in a

language different from that of the question.

Table 4 Statistics about QA at CLEF campaign over the years

Participants Submitted

runs

Monolingual

runs

Cross-lingual

runs

Activated

tasks

Tasks chosen

by at least

1 participant

Target

languages

2003 8 17 6 11 8 7 3

2004 18 (?125%) 48 20 28 56 19 7

2005 24 (?33.33%) 67 43 24 81 23 8

2006 30 (?20%) 77 42 35 24 24 9

2007 22 (-26.67%) 37 20 17 37 18 10

2008 21 (-4.76%) 51 31 20 43 20 11

2009 11 (-47.62%) 28 26 2 110 7 10

2010 13 (?18.18%) 49 45 4 50 9 7

Cross-language evaluation forum 2003–2010

123

Ta

ble

5L

ang

uag

esat

QA

@C

LE

F

Yea

rM

ono

lin

gual

run

sC

ross

-lin

gual

run

s

20

03

IT(2

),N

L(2

),S

P(2

)E

S-E

N(2

),F

R-E

N(6

),IT

-EN

(2),

DE

-EN

(1)

2004

DE

(1),

ES

(8),

FR

(2),

IT(3

),N

L(2

),P

T(3

)B

G-E

N(1

),B

G-F

R(2

),E

N-F

R(2

),E

N-N

L(1

),E

S-F

R(2

),D

E-E

N(3

),D

E-F

R(2

),F

I-

EN

(1),

FR

-EN

(6),

IT-E

N(2

),IT

-FR

(2),

NL

-FR

(2),

PT

-FR

(2)

20

05

BG

(2),

DE

(3),

ES

(13

),F

I(2

),F

R(1

0),

IT(6

),N

L(3

),P

T(4

)B

G-E

N(1

);D

E-E

N(1

),E

N-D

E(3

),E

N-E

S(3

),E

N-F

R(1

),E

N-P

T(1

),E

S-E

N(1

),

FI-

EN

(2)

FR

-EN

(4),

IN-E

N(1

),IT

-EN

(2),

IT-E

S(2

),IT

-FR

(2),

PT

-FR

(1)

20

06

BG

(3),

DE

(6),

ES

(12

),F

R(8

),IT

(3),

NL

(3),

PT

(7)

EN

-DE

(2),

EN

-ES

(3),

EN

-FR

(6),

EN

-IT

(2),

EN

-NL

(3),

EN

-PT

(3),

ES

-EN

(3),

FR

-EN

(4),

FR

-ES

(1),

DE

-EN

(1),

ES

-PT

(1),

IT-E

N(1

),P

T-E

S(1

),R

O-E

N(2

),

IN-E

N(1

)P

L-E

N(1

),P

T-F

R(1

)

20

07

DE

(3),

ES

(5),

FR

(1),

IT(1

),N

L(2

),P

T(7

),R

O(3

)D

E-E

N(1

),E

N-D

E(1

),E

N-F

R(1

),E

N-N

L(2

),E

N-P

T(1

),E

S-E

N(1

),F

R-E

N(2

),

IN-E

N(1

),N

L-E

N(2

),P

T-D

E(1

),R

O-E

N(1

)

20

08

BG

(1),

DE

(6),

ES

(6),

EU

(1),

FR

(1),

NL

(2),

PT

(9),

RO

(4)

DE

-EN

(3),

EN

-DE

(3),

EN

-EU

(1),

EN

-ES

(2),

EN

-FR

(1),

EN

-NL

(2),

ES

-DE

(2),

ES

-EU

(2),

FR

-ES

(1),

NL

-EN

(1),

PT

-FR

(1),

RO

-EN

(1)

20

09

DE

(2),

EN

(10

),E

S(6

),F

R(3

),IT

(1),

RO

(4)

EU

-EN

(2)

2010

DE

-4,0

EN

-16,3

ES

-6,1

FR

-5,2

IT-2

,1P

T-1

,0R

O-4

,0E

N-R

O-2

,0E

U-E

N-2

,0

Nu

mb

ero

fru

ns

for

each

mo

no

ling

ual

lan

gu

age

and

for

each

cro

ss-l

ing

ual

lan

gu

age

pai

r

A. Penas et al.

123

In 2010 the exercise of the previous campaign was replicated almost identically,

considering the same languages, with the exception of Bulgarian. A slight increase

in participation was registered, passing from 11 to 13 participants, who submitted 49

runs, twice as many as in the previous year. The preference for monolingual tasks

was confirmed, as only two participating teams attempted cross-lingual tasks,

namely Basque-English and English-Romanian. This trend, and the fact that 22 out

of 49 submitted runs were monolingual English, suggests that multilinguality was

not perceived as a priority in the last two campaigns.

3.5 Pilot exercises

QA at CLEF was also an opportunity to experiment with several pilot tasks, as

Table 6 shows, whose common goal was to investigate how QA systems and

technologies are able to cope with different types of questions from those proposed

in the main task, experimenting with different scenarios. The following pilot tasks

have been proposed over the years:

• Real Time Question Answering (Noguera et al. 2007): an exercise for the

evaluation of QA systems within a time constraint, carried out in the 2006

campaign, and proposing new measures which combine Precision with the

answer time.

• Answer Validation (Penas et al. 2006): a voluntary exercise to promote the

development and evaluation of sub-systems aimed at validating the correctness

of the answers given by a QA system. The basic idea was that once an

[answer ? snippet] pair is returned to a question by a QA system, an Answer

Validation module has to decide whether the answer is correct according to the

supporting snippet.

• Question Answering over Speech Transcripts (Lamel et al. 2007): the aim was to

evaluate QA technology in a real multilingual speech scenario in which written

and oral questions (factual and definitional) in different languages were

Table 6 Pilot tasks at QA at CLEF campaigns over the years

Answer

Validation

Exercise

(AVE)

Extraction

of Novel

Wikipedia

Information

(WiQA)

Geographic

Wikipedia IR

(GikiCLEF)

QA over

Speech

Transcriptions

(QAST)

Real

Time QA

Word Sense

Disambiguation

QA (WS-QA)

2003

2004

2005

2006 Y Y Y

2007 Y Y

2008 Y Y Y

2009 Y Y

2010

Cross-language evaluation forum 2003–2010

123

formulated against a set of audio recordings related to speech events in those

languages. The scenario was the European Parliament sessions in English,

Spanish and French.

• Word Sense Disambiguation for Question Answering (see Forner et al. 2008,

Section 3.6): a pilot task which provided the questions and collections with

already disambiguated word senses in order to study their contribution to QA

performances.

• Question Answering using Wikipedia (Jijkoun & de Rijke 2007): the purpose

was to see how IR and NLP techniques could be effectively used to help readers

and authors of Wikipedia pages to access information spread throughout

Wikipedia rather than stored locally on the pages. Specifically, the task involved

detecting whether a snippet contained new information or whether it duplicated

what was already known.

• GikiCLEF (Santos and Cabral 2009): following the previous GikiP pilot at

GeoCLEF 2008, the task focused on open list questions over Wikipedia that

require geographic reasoning, complex information extraction, and cross-lingual

processing, for Bulgarian, Dutch, English, German, Italian, Norwegian,

Portuguese, Romanian and Spanish.

4 Performance assessment

The evaluation performed for a specific QA track depended on the concrete

objectives of each year: once these were set, the organisers tried to choose an

appropriate evaluation method. This implied determining specific features of the

collections, as well as selecting the measures for assessing the performance of

participating systems.

For each question in the test set systems were required to return at least one

answer along with a text supporting the correctness of that answer. Until 2005, the

supporting information was the id of a document, while starting from 2006, systems

had to return a supporting snippet (no more than 500 bytes) containing the answer.

Answers were judged by native language human assessors, who assigned to each

response a unique judgment following the schema already established in TREC

(Voorhees 2000):

• Right (R): the answer string consisted of nothing more than an exact answer and

it was supported by the accompanying text;

• Wrong (W): the answer string did not contain a correct answer;

• Unsupported (U): the answer was correct, but it was impossible to infer its

correctness from the supporting text;

• IneXact (X): the answer was correct and supported, but the answer string

contained either more or less characters than the exact answer.

Once the answers had been manually assessed, the following step in the

evaluation was to give a set of numeric values summarising the performance of each

system. These values were given with two purposes:

A. Penas et al.

123

1. To compare the performance of different systems. In fact, numeric scores

permit not only to judge which system is best, but also to study the same system

with different configurations, by analysing the effect of including new features

on the performance of the systems.

2. To predict the performance of a system in future real scenarios. One of the

objectives of such an evaluation as that performed in QA at CLEF is to predict

in a controlled environment the behavior that a system would have in the real

world.

The scores assessing the performances of QA systems at CLEF were calculated

using different evaluation measures, each of which was based on the information

derived from human assessments. As each measure is generally aimed at analysing

only a specific aspect of the behavior of a system, one should be careful when

drawing conclusions on the basis of a single metric, since it is probably appropriate

only to assess a particular system feature.

4.1 Evaluation measures applied

Several evaluation measures have been used in the QA at CLEF campaigns. In each

competition a main measure was selected to rank the results of the participating

systems, while several additional measures were adopted in order to provide more

information about the systems’ performances.

Mean Reciprocal Rank (MRR) was used in the first campaign (2003) as the main

evaluation measure, while in the following years it was employed as a secondary

measure whenever more than one answer per question was requested. MRR is

related to the Average Precision used in Information Retrieval (Voorhees and Tice

1999) and was used at CLEF when systems had to return up to three answers per

question ranked by confidence, putting the surest answer in the first place.

According to MRR, the score for each question is the reciprocal of the rank at

which the first correct answer is given. That is, each question can be scored 1; 0.5;

0.333; or 0 (in the case where none of the three answers given is correct). The final

evaluation score is the mean over all the questions. Thus, MRR allows to evaluate

systems giving more than one answer per question, acknowledging the precision of

systems that place correct answers in the first positions in the answer ranking.

The most used evaluation measure in the QA at CLEF was accuracy, which is the

proportion of questions correctly answered. In the case of having more than one

answer per question, accuracy takes into consideration only the first answer.

Accuracy acknowledges a more precise behaviour than MRR, since it only takes into

account one answer per question. This is why it was used as the main evaluation

measure from 2004 to 2008 (inclusive), while it was exploited as a secondary

measure in 2009, where c@1 was introduced.

With c@1 (Penas and Rodrigo 2011), systems can either respond to a question, or

leave it unanswered if they are not confident about finding a correct answer. The

main rationale behind c@1 is that, in some scenarios (for instance in medical

diagnosis), to leave a question unanswered is preferable to giving an incorrect one.

In fact, c@1 rewards the ability of a system to maintain the number of correct

Cross-language evaluation forum 2003–2010

123

answers, while reducing the number of incorrect ones by leaving some questions

unanswered. This is effectively a strategy of increasing precision while maintaining

recall, an essential provision for any system which is expected to be employed by

real users. The formulation of c@1 is given in (1), where:

nR: number of questions correctly answered

nU: number of unanswered questions

n: total number of questions

c@1 ¼ 1

nnR þ nU

nR

n

� �ð1Þ

It must be noticed that the concept of leaving a question unanswered is different

from giving a NIL answer. In the former case, a system shows that it is not able to

find a correct answer to the question, while in the latter the system’s conclusion is

that there is not correct answer to the question in the target collection.

The adoption of c@1 was a consequence of the Answer Validation Exercises4

(AVE) carried out as a subtask of QA at CLEF from 2006 to 2008 (Penas et al.

2006, 2007; Rodrigo et al. 2008). In AVE, the development of Answer Validation

technologies was sustained by the effort of improving the ability of QA systems to

determine the correctness of their answers, and therefore, to reduce the number of

incorrect answers. Because AVE showed that it was possible to improve QA results

by including deeper analysis concerning the correctness of answers, it was decided

to transfer this idea to the main task by using the c@1 measure.

The rest of the measures used in QA@CLEF evaluations (always as secondary

measures) were focused on evaluating systems’ confidence in the correctness of

their responses. Confidence Weighted Score (CWS) (Voorhees 2002), which had

already been used for evaluating QA systems at TREC could be applied when

systems ordered their answers from the most confident response to the least

confident one. CWS rewards a system for a correct answer early in the ranking, more

than for a correct answer later in the ranking. The formulation of CWS is given in

(2), where n is the number of questions, and C(i) (Eq. 3) is the number of correct

answers up to the position i in the ranking. I(j) is a function that returns 1 if answer jis correct and 0 if not. CWS gives more value in the final score to some questions

over other ones. Specifically, questions whose correct answers are in the highest

positions of the ranking contribute significantly to the final score, while questions

with answers at the bottom of the ranking do not contribute so much.

CWS ¼ 1

n

Xn

i¼1

CðiÞi

ð2Þ

CðiÞ ¼Xi

j¼1

IðjÞ ð3Þ

4 http://nlp.uned.es/clef-qa/ave/.

A. Penas et al.

123

Other two measures focused on the evaluation of systems’ self-confidence, K and

K1, were adopted in a pilot task at CLEF 2004 (Herrera et al. 2005). In order to

apply K and K1, QA systems had to return a real number between 0 and 1 associated

to each answer that indicated their confidence in the given answer. When a system

gave value 1, it meant that the system was totally sure about the correctness of its

answer, while 0 meant that the system did not have any evidence supporting the

correctness of the answer.

K and K1 are based on a utility function that returns -1 if the answer is incorrect

and 1 if it is correct. This positive or negative value is weighted with the normalised

confidence self-score given by the system to each answer. The formulation of K1 is

shown in (4), while the formulation of K (a variation of K1 for use where there is

more than one answer per question) is shown in (5). In these formulas, R(i) is the

total number of known answers to question i that are correct and distinct;

answered(i) is the number of answers given by a system for question i; self_score (r)

is the confidence score assigned by the system to answer r and eval (r) depends on

the judgement given by a human assessor:

• eval(r) = 1 if r is judged as correct

• eval(r) = 0 if r is a repeated answer

• eval(r) = -1 if r is judged as incorrect

K1 ¼P

i2fcorrect answersÞ self scoreðiÞ �P

i2fincorrect answersÞ self scoreðiÞn

ð4Þ

K ¼ 1

#questions

Xi2questions

Pr2answersðiÞ self scoreðrÞ � evalðrÞ

maxfRðiÞ; answeredðiÞg ð5Þ

K and K1 ranks between -1 and 1. However, the final value given by K and K1 is

difficult to interpret: a positive value does not indicate necessarily more correct

answers than incorrect ones, but that the sum of scores of correct answers is higher

than the sum of scores of incorrect ones.

5 Discussion of results

The QA campaigns can be divided into three eras as it can be seen from Table 1.

The division has been made considering the collections used and the type of

questions:

• Era I: 2003–2006. Ungrouped mainly factoid questions asked against monolin-

gual newspapers; Exact answers returned.

• Era II: 2007–2008. Grouped questions asked against newspapers and Wikipe-

dias; Exact answers returned.

• Era III: 2009–2010. Ungrouped questions against multilingual parallel-aligned

EU legislative documents; Passages or exact answers returned.

Cross-language evaluation forum 2003–2010

123

In considering results from different years at CLEF, we need to bear in mind the

following points: Firstly, the task to be performed may differ from year to year; the

task in a particular year may be easier or harder than that of the previous year, and

this could result in a general level of performance which is higher or lower

respectively. Secondly, in Eras I and II the document collections used were

necessarily different for each target language. Naturally this affects results, though it

does not invalidate general comparisons between languages. Thirdly, even if

questions and documents are identical, as in Era III, there may be intrinsic

differences between languages which preclude exact comparison (see further

discussion later). Nevertheless, performance figures and comparisons between them

give an important indication of the state of the field, the activity in different

countries and the issues in language processing which need to be tackled—all

important and substantive matters.

Table 7 summarises the results from 2003 to 2010 in terms of accuracy; these are

given as the percent of questions which were answered correctly, to the nearest 1%.

Since the task was quite different in each of the above eras, we need to consider the

evaluation results separately.

In the first era (2003–2006), monolingual factoid QA showed a steady

improvement, starting at 49% in the first year and increasing to 68% in the fourth

(2006). Interestingly, the best system was for a different language in each of those

years—Italian, Dutch, Portuguese and French respectively. The improvement can be

accounted for by the adoption of increasingly sophisticated techniques gleaned from

other monolingual tasks at TREC and NTCIR, as well as at CLEF. However, during

the same time, cross-lingual QA showed very little improvement, remaining in the

range 35–49%. The bottleneck for cross-lingual QA is Machine Translation and

Table 7 Results at QA@CLEF based on accuracy

Year Monolingual Cross-lingual

Mean (%) Best Ans Mean (%) Best Ans

Era I

2003 29 49% IT Exact 17 45% IT-EN Exact

2004 24 46% NL Exact 15 35% EN-NL Exact

2005 29 65% PT Exact 18 40% EN-FR Exact

2006 28 68% FR Exact 25 49% PT-FR Exact

Era II

2007 23 54% FR Exact 11 42% EN-FR Exact

2008 24 64% PT Exact 13 19% RO-EN Exact

Era III

2009 41 61% EN Para 16 18% EU-EN Para

2010 51 72% EN Para 28 30% EN-RO Para

These are given as the percent of questions answered exactly right, to the nearest 1%. In 2003, three

attempts were allowed at each question and if one of these was correct, the answer was ‘‘exactly right’’.

For results in terms of the other measures C@1 (2009-10), CWS (2004-8), K1 (2005-7) and MRR (2003,

2006 and 2008) see the next table

A. Penas et al.

123

clearly the required improvement in MT systems has not been realised by

participants in the task.

As a general remark, systems which attempted a cross-language task in addition

to a monolingual one did not show a similar performance trend in the two tasks, the

cross-language task recording much lower scores. For example, the QRISTAL

system developed by Synapse Developpement in 2005 (Laurent et al. 2005)

participated in four tasks having French as target language—namely monolingual

French, English-French, Italian-French, and Portuguese-French. Meanwhile it

obtained good results in the monolingual task, reaching 64%, its performance

decreased in the cross-language tasks, scoring 39.50, 25.50, 36.50% respectively.

Another example is the 2006 Priberam system (Cassan et al. 2006): it performed

well in the monolingual Portuguese task, with an accuracy of 69%, but in cross-

lingual Spanish-Portuguese task its accuracy dropped to 29%. Similarly, the system

scored 51% in the monolingual Spanish task, but only 34.4% in the cross-lingual

Portuguese-Spanish task.

In the second era (2007–2008), the task became considerably more difficult

because questions were grouped around topics and in particular because, sometimes,

it was necessary to use coreference information across different questions.

Monolingual performance dropped 14%, from its previous high of 68% in 2006

to 54% in 2007, and then increased to 64% in 2008. Once again the language was

different in each year—first French and then Portuguese. At the same time, cross-

lingual performance decreased from the 2006 figure of 49% (PT-FR) in the previous

Era to 42% (EN-FR) in 2007. Relative to the change in monolingual system

performance, this was a smaller decrease. Then, in 2008, the figure fell to 19% (RO-

EN). This dramatic change can be explained by the fact that the monolingual

systems in Era II were the same as those in Era I, while the highest performing

cross-lingual system of 2007 was from a particularly important group which has

consistently achieved very good results at TREC. Unfortunately this group chose

not to participate in 2008.

In the third era (2009–2010), the task changed to one of paragraph retrieval while

at the same time the questions and document collection became more difficult.

Monolingual performance started at a similar level of 61% in 2009 and then rose to

72% in 2010. Cross lingual performance was 18% (EU-EN) in 2009 and rose to

30% (EN-RO) in 2010. These very low figures can be accounted for by the fact that

there was very little participation in the cross-lingual task during the third era.

Concerning monolingual performance taken over all 8 years, which language

scores the highest? Generally, the language of the best system tended to change

from year to year. Taken alphabetically and considering the top scoring system for

each year, we had one Dutch, two English, two French, one Italian and two

Portuguese. There are a number of factors which influence this, including the

languages which are allowed in any particular year, and also the groups which are

able to participate. Generally, however, we can conclude that very good systems

were developed in a number of different languages, a key aim of CLEF in contrast

to TREC. Concerning the language pairs of the best cross-lingual systems, they

changed every year and the only pair which occurred twice was EN-FR. Most

Cross-language evaluation forum 2003–2010

123

groups, therefore, appear to have developed some cross-lingual expertise through

CLEF, though the performance of their systems is not necessarily very high.

5.1 Comparing results with different measures

The above discussion has been in terms of accuracy. However, as shown in the Sect.

4, several other measures have been used at CLEF, namely C@1, CWS, K1 and

MRR. Results in terms of these can be seen in Table 8. As we have seen, the latter

three measures all take into account a system’s confidence in its own answer, as

measured by a real number. We can also consider C@1 in the same way, because in

this case, a system is required to give a boolean confidence measure, either 0 or

100%. If the confidence is 0% then the answer is withheld and if it is 100% the

answer is returned.

Results according to different metrics cannot be strictly compared. This is

because different measures acknowledge different behaviors of systems. We find

interesting to consider whether the results given under these measures are different

from those given by simple accuracy, as this may show that the best scoring system

can be different depending on the evaluation measure used. In considering Table 7,

we should remember that not all runs in a particular year were judged using the

alternative measures, even when these measures were in force. One of the reasons

for this is that certain systems—sometimes very high scoring ones—are not

Table 8 Results at QA@CLEF based on C@1, CWS, K1 and MMR

Year Monolingual Cross-lingual

Mean Best Ans Mean Best Ans

C@1 scores

2009 0.42 0.68 RO Para 0.17 0.18 EU-EN Para

2010 0.54 0.73 EN Para 0.32 0.36 EU-EN Para

CWS scores

2004 0.135 0.333 DE Exact 0.064 0.177 DE-EN Exact

2005 0.153 0.385 DE Exact 0.085 0.203 DE-EN Exact

2006 0.247 0.557 FR Exact 0.269 0.495 PT-FR Exact

2007 0.064 0.175 ES Exact 0.079 0.222 EN-FR Exact

2008 0.085 0.342 NL Exact 0.05 0.235 EN-NL Exact

K1 scores

2005 -0.161 0.221 DE Exact -0.257 0.060 EN-DE Exact

2006 -0.378 0.273 FR Exact -0.23 -0.179 EN-FR Exact

2007 -0.261 0.043 IT Exact 0.124 0.124 EN-FR Exact

MRR scores

2003 0.326 0.422 IT Exact* 0.215 0.322 IT-EN Exact*

2006 0.319 0.679 FR Exact 0.337 0.495 PT-FR Exact

2008 0.199 0.448 ES Exact 0.131 0.240 EN-DE Exact

A. Penas et al.

123

designed to return confidence scores, and without these values, some of these

measures cannot be computed.

Starting with C@1 (2009–2010, Era III) we can see that for monolingual QA the

best system was EN by both C@1 and accuracy in 2010 but that in 2009 it was RO

by C@1 and EN by accuracy. In fact, the same EN system in 2010 had the best

C@1 and the best accuracy. However, in 2009, the RO system with C@1 0.68 in

Table 7 had an accuracy of only 0.52 whereas the EN system with an accuracy of

0.61 in Table 6 only had a C@1 of 0.61. Concerning cross-lingual results, they

concurred in both 2009 and 2010 (EU-EN).

Turning to CWS during 2004–2008 (Era I 2004–2006 and Era II 2007–2008) the

results do not concur with those of accuracy except in 2006; the same FR system in

that year had the best CWS and the best accuracy. Similarly, cross-lingual results

only concurred in 2006 (PT–FR) and 2007 (EN–FR). In both cases it was in fact the

same system which had the best CWS and the best accuracy.

Concerning K1 during 2005–2007 (Era I 2005–2006 and Era II 2007) for

monolingual results these concurred with accuracy only in 2006 (as with CWS).

Once again the language was FR and it was the same French run. Cross-lingual

results only concurred in 2007 (EN–FR) where it was also the same EN-FR run.

Regarding MRR during 2003, 2006 and 2008 (Era I, Era I and Era II

respectively), the monolingual results concurred in 2003 (IT) and 2006 (FR) but not

in 2008. The best MRR in 2003 was the same Italian run which obtained the best

accuracy, and similarly for the French run in 2006. Cross-lingual results concurred

in 2003 (IT–EN) and in 2006 (PT–FR) but not in 2008.

Another question concerning different evaluation measures is, where there is an

increase in accuracy, is there a comparable increase in the measure? The answer for

C@1 appears to be yes—the trend for C@1 scores seems to be quite similar to that

for accuracy scores: monolingual C@1 (2009–2010) increased from 0.68 to 0.73 as

against an increase in accuracy from 0.61 to 0.72. Cross lingual results went from

0.18 to 0.36 by both measures.

Turning to CWS and K1, however, the trend is not clear-cut, but since not all

systems returned these scores, it is hard to draw a firm conclusion.

Regarding changes in MRR scores, the monolingual trend for the years 2003,

2006 and 2008 was 0.422, 0.679, 0.448 while accuracy figures were 0.49, 0.68, 0.64.

However, the anomaly of 0.448 is accounted for because it was for a system with

accuracy 0.43. The more accurate systems did not return MRR scores. So, generally,

MRR appears similar to accuracy.

5.2 Comparing results across languages

The number of runs submitted in each language (monolingual) and language pair

(cross-lingual) across the three eras is shown in Table 5. As can be seen, the main

interest has always been in the monolingual systems, with the majority of teams

building a monolingual system in just their own language. Naturally, most groups

are also capable of building a good English monolingual system, but these have not

been allowed at CLEF except in Era III. However, cross-lingual runs from or to

Cross-language evaluation forum 2003–2010

123

English are allowed, and as the table shows, most of the runs between languages are

indeed either from English to the language of the team or the other way around. What

follows from this is that a relatively high number of cross-language tasks are activated

each year with a very small number of runs (often just one or two) being submitted for

each. This has led to some criticism of the QA track at CLEF, that there are too many

languages and language pairs involved and that results are therefore not comparable

between language pairs. We turn to this point next, but we should also note in passing

that Europe is a highly multilingual region with many more languages than are

represented here. It seems fitting therefore that CLEF should encourage the

development of systems in as many of these languages as possible.

If several systems perform the same task on the same language pair, direct

comparison is of course possible. However, as discussed above, the nature of CLEF

means that this is rarely possible. So, can performance on different tasks be

compared? Up until 2009 (i.e. in Eras I and II), each target language had its own

document collection and corresponding set of questions which were then back-

translated into the source languages. Thus all tasks of the form S–T (with a fixed

target language T) were answering the same questions (albeit in different source

languages S) against the same target collection in T. This made a measure of

comparison possible, mainly in the case where T was EN since this was a task which

was within the means of most groups through their familiarity with English.

In order to take this comparison further, a new strategy was adopted in 2009

whereby a parallel aligned collection was used (Acquis) meaning that the questions

and document collection were exactly the same for all monolingual tasks as well as

all cross-lingual tasks.

Moreover, some interesting additional experiments were performed at UNED

(Perez-Iglesias et al. 2009). Firstly, the document collections in all the various target

languages were indexed by paragraph, using the same IR engine in each case. The

queries in each language were then input to the corresponding IR system, and the

top ranking paragraphs returned were used as ‘baseline’ answers—this was possible

because the task that year was paragraph selection, not exact answer selection.

Interestingly, many systems returned results which were worse than the baseline, a

situation which probably arose because UNED tuned the parameters in their system

very carefully. Something similar was observed at TREC-8 (Voorhees and Tice

1999) where the AT&T system using passage retrieval techniques performed well

against those using QA techniques.

In the second experiment, UNED compared the performance of the baseline

systems across languages. Because all languages were answering the same questions

on the same collection, this enabled them to estimate the intrinsic difficulty of the

language itself. By applying the resulting difficulty coefficients to the various

submitted runs, they were able to make more accurate comparisons between them.

6 Techniques used by participants

In this section we outline some of the developments in QA which have taken place

during the various campaigns. As has already been mentioned, QA in the sense being

A. Penas et al.

123

discussed here started at TREC in 1999. It ran there for several years before commencing

at CLEF in 2003. It follows from this that most established groups had developed

sophisticated techniques in English before CLEF started. Moreover, most first-

generation QA systems evolved into a common architecture: Question analysis leading

to question type determination; document/passage selection; candidate answer extrac-

tion; and finally, answer selection. This architecture and a detailed discussion of QA can

be found in Hirschman and Gaizauskas (2001) and Prager (2006).

What follows relates specifically to QA at CLEF. For each year, the best three

monolingual systems, independent of language, were identified, as well as the single

best cross-lingual system. Descriptions of these systems were then studied in order

to observe general trends over the years of the evaluation exercise. We refer to a

group’s CLEF overview paper by the name of the group, the reference to the paper,

and the year of the campaign, e.g. Amsterdam; Jijkoun, et al. (2003).

Concerning monolingual systems, the first observation is that one key to success

has been the use of vast numbers of hand-tuned patterns in order to achieve high

performance. Systems include Priberam and Synapse, both of which have

repeatedly achieved very high scores. For example, Priberam uses Question

Patterns to assign queries to categories (possibly more than one), Answer Patterns to

assign document sentences to various categories at indexing time, depending on

what kind of question they could answer, and Question Answering Patterns to

extract an answer to a question. This work is labour-intensive. Priberam spent

twelve person-months converting a lexicon and associated rules from Portuguese

into Spanish (Priberam; Cassan et al. 2006). Conversion of question analysis rules to

Spanish took a further 2 months. The use of detailed answer patterns is of course not

new and goes back to Hovy et al. (2001).

In addition, vast resources—often hand compiled—are often used. For example,

in 2006, Synapse (Laurent et al. 2006) reported a nominal dictionary of 100,000

entries and a multilingual list of 5,000 proper names in a number of different

languages, as well as the use of 200,000 translations of words or expressions. These

materials were specially refined, checked and further developed on a continuous

basis. Information of interest can include lists of persons, places etc. with additional

information (such as a person’s occupation or a place’s population), lexical data,

ontologies and so on. Sometimes semi-automatic methods are used for creating

these, such as extracting them from document collections, Wikipedia or the Web.

However, hand-correction and refinement is always the key to top performance.

A second theme has been the rise in importance of answer validation techniques.

Given a list of likely correct answers, the correct one must be chosen. In early

systems, the answer was frequently there but was somehow missed.

An early form of answer validation involved the use of the web, as pioneered by

Magnini et al. (2002) and widely used by other systems thereafter. Here, an association

is sought on the web between terms from the query and the candidate answer. If this is

found, it suggests that the answer is correct. Systems at CLEF using this technique

include Alicante (Vicedo et al. 2003, 2004), ITC-irst (Negri et al. 2003), and Evora

(Saias and Quaresma 2007, 2008) and it was particularly suitable during the years

when the questions were of the ‘‘trivia’’ factoid type since information concerning the

answers was readily found outside the official document collections.

Cross-language evaluation forum 2003–2010

123

Another form of answer validation which has been widely adopted is the

comparison of n-grams between query and candidate document (Indonesia, Toba

et al. 2010; Romanian Academy, Ion et al. 2009; UNED, Rodrigo et al. 2009). This

of course takes into account both word occurrence and word order, and is no doubt

inspired by the BLEU automatic evaluation measure for machine translation

systems (Papineni et al. 2002). Typically 1-grams or 2-grams are used though one

approach uses up to 5-grams (UNED, Rodrigo et al. 2009).

A final popular form of answer validation is textual entailment (measured by

logical deduction, WordNet word chaining etc.) to link a query term to a candidate

paragraph term (Indonesia, Toba et al. 2010; Romanian Academy, Ion et al. 2009).

If such a chain is found, it supports the hypothesis that the answer is correct.

From an architectural perspective, a major innovation has been the multi-stream

approach of Amsterdam (Amsterdam, Jijkoun et al. 2003, 2004). They introduce

redundancy into the structure of their system by dividing it into a number of

different streams, each of which independently searches for answer candidates.

These are merged at the end before selecting the system’s response. In 2003 there

were five streams but by 2004 this had risen to eight.

The idea behind ‘‘Early Answering’’ is to mine possible answers to questions

from the document collection and elsewhere, prior to answering any questions

(Clarke et al. 2001). Answers are saved in a series of databases and can then be used

to determine the answer to certain questions prior to searching the document

collection. At CLEF a successful use of this technique can be seen in the systems of

Amsterdam (Jijkoun et al. 2003, 2004) and Avignon (Gillard et al. 2006).

CLEF has witnessed a number of interesting developments concerned with the

retrieval process. Firstly, there are different units which can be indexed—whole

documents (Avignon, Gillard et al. 2006; Cuza Romania, Iftene et al. 2009; Romanian

Academy, ion et al. 2009), blocks of arbitrary fixed size such as 1 kB (Synapse,

Laurent et al. 2005), paragraphs (Cuza Romania, Iftene et al. 2009; Romanian

Academy, Ion et al. 2009; Valencia, Correa et al. 2009), passages of variable size

(Avignon, Gillard et al. 2006) or sentences (Alicante, Vicedo et al. 2003; Priberam,

Amaral et al. 2008). In some cases, several different indices are used simultaneously—

Synapse (Laurent et al. 2005) reports the use of eight. Conversely, Indonesia (Toba

et al. 2010) have three indices of the same text units indexed using different

algorithms. Comparison between the results forms an important part of their

successful retrieval component. Aside from the number of indices and the amount of

text indexed, there is the question of what to index by, other than keywords. Prager’s

landmark work on Predictive Annotation (indexing by NE type) (Prager et al. 2000)

has been hugely influential and related ideas can be seen at CLEF. For example

Priberam (Amaral et al. 2005, Cassan et al. 2006) use Answer Patterns to assign

document sentences to various categories, depending on what kind of question they

could answer. Similarly, Synapse (Laurent et al. 2005, 2006, 2007, 2008) index by

named entity types but also be question type, answer type and field of study (e.g.

aeronautics).

Related to indexing is the broader issue of text representation, called the

‘‘Document Representative’’ by van Rijsbergen (1979). An information retrieval

system traditionally uses an inverted index based on keywords or phrases. The

A. Penas et al.

123

developments mentioned above extend this with named entities, query types etc.

However, the index often remains the sole remnant of the document collection.

However, there have been some interesting developments using representations of

parsed and disambiguated texts. These include deep parsing with WOCADI and

MULTINET (Hagen, Hartrumpf 2004), dependency parsing (Groningen, Bouma

et al. 2005), and constraint grammars together with semantic representations and

logical representations of both queries and text sentences (Evora, Saias and

Quaresma 2007). There are of course many problems with such approaches,

including structural and semantic disambiguation. However, accuracy in parsing is

improving all the time—Groningen (Bouma et al. 2005) report 88%. In addition,

there is the issue of efficiency. The Hagen (Hartrumpf 2004) group report a single-

CPU time of 5–6 months to parse the collection while Groningen (Bouma et al.

2005) mentions 17 months. In the latter case, they speed the process up dramatically

by dividing it up among processors in a Beowulf cluster.

Alongside the use of pre-defined patterns mentioned earlier, there is the use of

machine learning algorithms which of course is now widespread in all areas of

natural language processing, including QA. Examples at CLEF of tasks carried out

with machine learning include query type identification using decision trees

(Avignon, Gillard et al. 2006 and others), NE recognition (most systems), and

probable answer type recognition in a passage (Romanian Academy, Ion et al.

2009).

CLEF is multilingual, and this has opened the way for groups to experiment with

the use of cross-lingual redundancy to improve monolingual performance

(Amsterdam, Jijkoun et al. 2003, 2004; Alicante, Vicedo et al. 2004). One way in

which this can be done is to search for answers in different languages (for example,

Amsterdam when working in monolingual Dutch look in the English collection as

well) and if an answer is found (e.g. in English) this can then be searched for in the

Dutch collection. Alicante have a similar strategy.

Following on from cross-lingual redundancy is the task of cross-lingual QA

itself. They key problem here is the need for high quality machine translation in

order not to introduce noise at the translation stage. This remains something of an

unsolved problem at CLEF—inspection of the CLEF monolingual vs. cross-lingual

results (see Sect. 5) shows that cross-lingual performance is still significantly below

monolingual performance. At CLEF, there have essentially been two approaches.

The first is the translation of words using predefined ontologies or dictionaries (ITC-

irst, Negri et al. 2003; Synapse, Laurent et al. 2005, 2006; Priberam, Cassan et al.

2006; Wolverhampton, Dornescu et al. 2008; Basque Country, Agirre et al. 2009,

2010). In many cases, resources are hand-tuned to optimise performance (see

Synapse in particular). They can also be derived semi-automatically, e.g. from

aligned Wikipedia pages (Wolverhampton, Dornescu et al. 2008). Words translated

individually in this way need to be disambiguated and validated with respect to the

target corpus. ITC-irst (Negri et al. 2003) present a way of doing this by generating

all possible combinations of candidate translations for all the words in the query and

then searching for these in the document collection. The most frequently co-

occuring combination of word senses found in the collection, for the maximal

number of words in the query, is chosen as being the most likely. A similar approach

Cross-language evaluation forum 2003–2010

123

is taken by the Basque Country (Agirre et al. 2009, 2010). Priberam (Cassan et al.

2006) tackle disambiguation by using the EuroParl parallel corpus to determine

likely translations for words. These can then be used to score candidate translations

created using their dictionaries and ontologies.

The Basque Country (Agirre et al. 2009, 2010) also adopt an interesting strategy

using true cognates between languages: they take a word in Basque, convert it to

several alternative possible spellings of the equivalent word in English, and then

search for these in the collection. Any sufficiently accurate match (by a string

distance measure) is considered a likely translation. This is a very useful technique

where parallel corpora and translation dictionaries are not available for a particular

language pair.

Finally, there is the use of English as a pivot language, as adopted by Synapse

(Laurent et al. 2005). To translate a word between Portuguese and French, they first

translate to English and then from English to French. This avoids the need for bi-

lingual dictionaries in all combinations of languages.

The second approach to translation within QA is the use of machine translation at

the sentence level, e.g. to translate the entire query and then process it

monolingually in the target language (Amsterdam, Jijkoun, et al. 2004; Language

Computer Corp., Bowden et al. 2007). Language Computer Corp. are the most

spectacular example of this, since they, alone among CLEF groups, translate the

entire document collection into English and then proceed with cross-lingual QA in

an entirely monolingual fashion, still yielding excellent results.

7 Research context

In this section we address some of the relevant research directions on QA which

have been explored in the last years outside the scope of QA at CLEF, although with

connections with the CLEF topics and often influenced by the CLEF achievements.

The purpose is to highlight how the QA at CLEF initiative had significant impact on

the global research context on QA. Among the numerous research directions, we

focus on Interactive QA and on QA over structured data.

7.1 Interactive QA

The ability of a QA system to interact with the user is crucial in order to realize

successful applications in real scenarios (Webb and Webber 2009). Providing the

explanation for an answer, managing follow up questions, providing justification in

case of failure, and asking clarifications can be considered as steps forward with

respect to fact-based Question Answering. Interactive QA has been addressed, both

at CLEF and TREC, respectively in the iCLEF and the ciQA tracks.

In iCLEF 2005 (Gonzalo et al. 2005), the interactive CLEF track focused on the

problem of Cross-Language Question Answering (CL-QA) from a user-inclusive

perspective. The challenge was twofold: (i) from the point of view of Cross-

Language QA as a user task, the question was how well systems help users locate

A. Penas et al.

123

and identify answers to a question in a foreign-language document collection; (ii)

from the point of view of QA as a machine task, the question was how well

interaction with the user helps a Cross-Language QA system retrieve better answers.

In other words, the ultimate issue was to determine how the QA system can best

interact with the user to obtain details about a question that facilitate the automatic

search for an answer in the document collection. For instance, in case of ambiguity,

the system may request additional information from the user, avoiding incorrect

translations (for translation ambiguity) or incorrect inferences (for semantic

ambiguity).

At TREC 2006, the ciQA (Kelly and Lin 2007) task (complex, interactive

Question Answering) focused both on complex information needs and interactivity

in the context of Intelligence Analytics. To these purposes topics were composed by

both a template, which provided the question in a canonical form, and a narrative,

which elaborated on what the user was looking for, provided additional context, etc.

In the template, items in brackets represented ‘‘slots’’ whose instantiation varies

from topic to topic. For example:

Template: What evidence is there for transport of [drugs] from [Bonaire] to [theUnited States]?

Narrative: The analyst would like to know of efforts made to discourage narcotraffickers from using Bonaire as a transit point for drugs to the United States.Specifically, the analyst would like to know of any efforts by local authorities as wellas the international community.

As for interactivity, participants had the opportunity to deploy a fully-functional

Web-based QA system for evaluation. For each topic, a human assessor could spent

5 min interacting with each system.

Interactivity has been further explored under the perspective of the background

knowledge that the system needs in order to provide both rich and natural answers

with respect to a given question, as well as clear explanations for failures. In

(Magnini et al. 2009) it is argued that such abilities are necessarily based on a deep

analysis of the content of both question and answer, and an ontology-based

approach is proposed to represent the structure of a question–answer pair in the

context of utterance. This work focuses on aspects relevant to interactivity in a

general QA setting, including the ability to (i) consider the context of utterance of a

question, such as time and location; (ii) provide rich answers containing additional

information (e.g. justifications) with respect to the exact answer; and (iii) explain

failures when no answer is found.

7.2 QA over structured data

The explosion of data available on the Web in a structured format (e.g. DBPedia,

Freebase) has fostered the interest of the research community toward Question

Answering systems able to provide answers from such data, and to perform

reasoning on large knowledge bases. This perspective is even more interesting as

data on the Web are now being ‘‘linked’’ (linked data) by means of the use of

standard formats for exposing, sharing, and connecting pieces of data, information,

Cross-language evaluation forum 2003–2010

123

and knowledge on the Semantic Web using URIs and RDF. From this perspective,

on the one hand, QA over structured linked data opens new perspectives, as

available data can potentially cover open domain knowledge; on the other hand,

most methodologies already experimented (for instance at QA at CLEF) for the

interpretation of questions can be reused and further extended within the new

scenario.

There are several research challenges relevant to QA over structured data,

including: (i) as many of Semantic Web applications refer to one specific domain, it

is crucial to develop techniques which can be easily ported from one domain to

another; (ii) reasoning over large amount of data implies efficient algorithms as well

as the ability to merge content from different sources; (iii) question interpretation

can take advantage of the temporal and spatial context of the utterance, in order to

provide more exact answers.

Among the initiatives on QA over structured data we mention QALL-ME, a

system which takes advantage of a ontology-based representation to provide precise

and rich answers, and PowerAqua, a system tailored to manage QA over large

knowledge bases.

The QALL-ME system (Ferrandez et al. 2011) is the outcome of an European

project whose goal has been the application of open domain methodologies for

question interpretation in the context of QA applications over data represented in an

ontology. The project has realized a shared infrastructure for multilingual and

multimodal Question Answering over structured data, which has been concretely

experimented and evaluated as an application for mobile phones, and which is

available as an open source software. A relevant feature of the system is Context

Awareness, according to which all questions are anchored to a certain space and

time, meaning that every question always has a spatial and temporal context. For

instance, using deictic expressions such as ‘‘here’’ or ‘‘tomorrow’’, a question posed

at eight o’clock in Berlin may potentially mean something completely different than

the same question posed at five o’clock in Amsterdam. Deictic expressions are

solved by algorithms which recognize temporal and spatial expressions in the

question and anchor relative expressions (e.g. ‘‘during the weekend’’, ‘‘the nearest’’)

to absolute expressions (e.g. ‘‘May, 22nd’’ ‘‘Unter den Linden, Berlin’’). In addition,

users may either explicitly indicate the spatial–temporal context in the question (e.g.

‘‘Which movies are on tomorrow in Trento?’’) or let the context implicit, in which

case it will be supplied by the system by means of default information (e.g. ‘‘Which

movies are on’’ would be interpreted using ‘‘today’’ and the name of the town where

the question is uttered).

PowerAqua (Lopez et al. 2009) takes as input a natural language query and

returns answers drawn from relevant semantic sources from anywhere on the

Semantic Web. The crucial steps performed by the system are (i) the identification

of the ontologies relevant to the input question; (ii) the disambiguation of the terms

in the question against the concepts of the ontology in order to avoid potentially

incoherent constructions; and (iii) the mapping of the question into an appropriate

query based on the conceptual schema of the ontologies. What makes Power Aqua

interesting with respect to QA at CLEF is that the system makes large use of

resources (e.g. WordNet) and techniques (e.g. disambiguation, similarity) which

A. Penas et al.

123

have been largely experimented during the CLEF evaluations, as reported in Sect. 6,

highlighting its impact outside the CLEF community.

8 Conclusions

Prior to QA at CLEF, almost all QA was in English. Since the task was started in

2003, numerous groups have participated and experiments have been conducted in

many different language pairs. The result is that there are now several QA research

groups in almost all the European countries and they have sufficient expertise to

create systems which can perform complex tasks including difficult types of

questions—e.g. opinion and reason questions. In addition to numerous research

innovations within systems themselves, there have also been steps forward in the

evaluation methodology. These have included the use of several new evaluation

measures, the progress towards comparison of systems in different languages, and

the development of sophisticated tools for the data preparation.

Over the years, different trends have characterised the evolution of the QA task at

CLEF. In the first era, the emphasis was on developing basic QA techniques and

adopting them in different target languages. In the second, focus was placed on the

problems of linking questions together. These challenges included the detection of

the topic and the resolution of co-references within the sequence. In the third, focus

was switched to the direct comparison of systems between languages, a goal

enabled by the adoption of the fully parallel paragraph-aligned Aquis collection of

EU legislative documents.

Moreover, while at the beginning the aim of the exercise was to assess systems’

ability to extract an exact answer, over the years the importance of also providing a

context supporting the correctness of the answer became more and more evident.

For this reason, first short text snippets were mandatory in order to support the

response; then, entire paragraphs replaced exact answers as the required systems’

output. Returning a complete paragraph instead of an exact answer also allowed the

comparison between pure IR approaches and current QA technologies.

An additional factor which prompted the advancement of the task was the

increasing awareness of the necessity to consider potential users. This need was

addressed in ResPubliQA which was set in the legal domain and aimed at meeting

the requirements of anyone wanting to make inquiries about European legislation,

which could include lawyers, government agencies, politicians and also ordinary

citizens.

Another important output has been the multilingual test sets and their associated

gold standard answers and document collections. These are made possible by the

ingenious paradigm of back-translation which was introduced in 2003 and has been

very successfully used at CLEF ever since. Moreover, all this material is available

online allowing groups in future to re-use the data produced in order to develop and

tune their systems.5

5 Currently all the material is available at http://celct.isti.cnr.it/ResPubliQA/index.php?page=Pages/

pastCampaigns.php.

Cross-language evaluation forum 2003–2010

123

Finally, what can be concluded from the results of the QA task itself? Generally,

English factoid QA as investigated at TREC over the years is no longer worth

studying. There is enough data available for developers. Following the activity at

CLEF, performance of monolingual non-English systems has improved substan-

tially, to the extent that they are approaching that of the best English systems. Now

is the time, therefore, to look at different types of question and different task

scenarios, a process which started with ResPubliQA6 in 2009–2010.

Concerning cross-lingual systems, their performance has not shown a comparable

improvement over the years to that of monolingual ones because high-performance

machine translation remains an unsolved problem, especially where named entities

are concerned (e.g. ‘Sur les quais’ translates as ‘On the Waterfront’). Thus

translation in the QA domain warrants further investigation if multilingual barriers

to text processing are to be overcome.

In 2011, QA at CLEF entered a new era with a completely different task:

Question Answering for Machine Reading Evaluation. Noticing that traditional QA

system architectures do not permit results far beyond 60% accuracy, we understood

that any real change in the architecture requires a previous development of answer

validation/selection technologies. For this reason, the new formulation of the task

after 2010 leaves the step of retrieval aside, to focus on the development of

technologies able to work with a single document, answering questions about it and

using the reference collections as sources of background knowledge that help the

answering process.

Acknowledgments The QA evaluation campaigns at CLEF are a joint effort involving manyinstitutions and numerous people who have collaborated on the creation of the data set in the variouslanguages involved each year, and have undertaken the evaluation of the results; our appreciation andthanks (in alphabetical order) goes to: Christelle Ayache, Inaki Alegria, Lili Aunimo, Maarten de Rijke,Gregory Erbach, Corina Forascu, Valentin Jijkoun, Nicolas Moreau, Cristina Mota, Petya Osenova,Victor Peinado, Prokopis Prokopidis, Paulo Rocha, Bogdan Sacaleanu, Diana Santos, Kiril Simov, ErikTjong Kim Sang, Alessandro Vallin and Felisa Verdejo. Also, support for the TrebleCLEF Coordination,within FP7 of the European Commission, Theme ICT-1-4-1 Digital Libraries and Technology EnhancedLearning (Contract 215231) must be acknowledged for its fund for ground truth creation. This work hasbeen partially supported by the Research Network MA2VICMR (S2009/TIC-1542) and Holopedia project(TIN2010-21128-C02).

Appendix: Examples of question types

Ex. 1.1: LocationQ: Where did the Purussaurus live before becoming extinct?

A: in South America’s Amazon Basin

Context (LA081694):

Its name is Purussaurus and it is the largest species of crocodile that ever lived, a

massive-jawed creature that thundered through South America’s Amazon Basin 6

to 8 million years ago before its kind perished.

6 http://celct.isti.cnr.it/ResPubliQA.

A. Penas et al.

123

Ex. 1.2: MeasureQ: How long is the coastline of Santa Monica Bay?

A: 50 miles

Context (LA120994):

On Thursday, Wilson endorsed the Santa Monica Bay Restoration Project’s action

plan ‘‘without condition,’’ calling it a vital, ambitious and broad approach to reduce

pollution and restore the ecology of the bay, which stretches along 50 miles of Los

Angeles County coast.

Ex. 1.3: ObjectQ: What does magma consist of?

A: molten rock

Context (LA112794):

The odorless gas, more dense than air and collecting inside the floorless cabin, was

welling up from a recent intrusion of magma, or molten rock, miles below the

surface of this mountain that is popular with winter skiers and summer cyclists.

Ex. 1.3: OrganisationQ: What museum is directed by Henry Hopkins?

A: UCLA/Hammer Museum

Context (LA062394):

The story quotes UCLA/Hammer Museum Director Henry Hopkins as saying the

Codex is among works in the Hammer collection being held in escrow as a

guarantee against any financial obligation that might result from a lawsuit filed in

1990 on behalf of Joan Weiss, the niece and sole heir to the fortune of Hammer’s

wife, Frances, who died in 1989.

Ex. 1.4: PersonQ: Who is John J. Famalaro accused of having killed?

A: Denise Huber

Context (LA082094):

Arizona’s governor signed extradition documents Friday that could force house

painter John J. Famalaro to return to Orange County to face charges… Famalaro is

accused of murdering 23-year-old Denise Huber, who vanished after her car broke

down on the Corona del Mar (73) Freeway in June, 1991.

Ex. 1.5: TimeQ: When did Hernando Cortes arrive in the Aztec Empire?

A: in 1519

Context (LA112794):

When Spanish conquistador Hernando Cortes arrived in 1519, the Aztecs welcomed

him, believing he was their returning god.

Ex. 1.6: OtherQ: What is another name for the ‘‘mad cow disease’’?

A: bovine spongiform encephalopathy

Cross-language evaluation forum 2003–2010

123

Context (LA091194):

The government has banned foods containing intestine or thymus from calves

because a new scientific study suggested that they might be contaminated with the

infectious agent of bovine spongiform encephalopathy, commonly called ‘‘mad

cow disease.’’

Ex. 2: DefinitionQ: What is Amnesty International?

A: human rights group

Context (GH951214):

The human rights group Amnesty International called Wei’s trial ‘‘a mockery of

justice’’.

Ex. 3: MannerQ: How did Jimi Hendrix die?

A: drug overdose

Context (LA030994):

Hendrix Death Investigation Over: The British government said Tuesday that it

would not hold a new inquest into the death 24 years ago of rock legend Jimi

Hendrix, who died of a drug overdose at age 27.

Ex. 4: Temporal restriction by eventQ: Who was Uganda’s President during Rwanda’s war?

A: Yoweri Museveni

Context LA072894:

‘‘The complicity of Ugandan President Yoweri Museveni should not be overlooked

in the Rwandan crisis,’’ he adds.

Ex. 5: Temporal restriction by dateQ: Which city hosted the Olympic Games in 1992?

A: Barcelona

Context (LA082194):

But after the 1992 Barcelona Olympics, Henrich called Fong and announced, ‘‘Al,

I’m really jazzed. I want to train for ‘96. I know I can do it.’’

Ex. 6: Temporal restriction by time intervalQ: By how much did Japanese car exports fall between 1993 and 1994?

A: 18.3%

Context (LA053194):

Japan’s vehicle exports in 1993–94 fell 18.3% to 4.62 million, the second straight

year of decline.

Ex. 7: Closed list questionQ: Name the three Beatles that are alive

A: Paul McCartney, George Harrison and Ringo Starr

Context (LA012994-0011):

A. Penas et al.

123

Paul McCartney, George Harrison and Ringo Starr—the three surviving

Beatles—are scheduled to reunite next month to record new music for a 10-hour

video documentary, also titled ‘‘The Beatles Anthology.’’

Ex. 8: Open list questionQ: What countries are members of the Gulf Cooperation Council?

A: Saudi Arabia, Kuwait, United Arab Emirates, Qatar, Oman and Bahrain

Context (LA101394-0380):

The Gulf Cooperation Council—whose members are Saudi Arabia, Kuwait,United Arab Emirates, Qatar, Oman and Bahrain—said in a formal communique

that the allied military buildup ‘‘should continue until they are sure that Iraq no

longer poses a threat.’’

Ex. 9: Grouped questions on topic ‘‘Gulf War Syndrome’’Q1: What is Gulf War Syndrome?

A1: an inexplicable, untreatable collection of afflictions that reportedly have

touched thousands who fought in the desert

Context (LA111494):

Their short, tragic lives—chronicled neatly by their mothers in family photo

albums—are raising new fears that the mysterious Gulf War syndrome, aninexplicable, untreatable collection of afflictions that reportedly have touchedthousands who fought in the desert, is now being passed on to the next generation.

Q2: How many people have been affected by it?

A2: 11,000

Context (LA121494):

Physicians said the 1,019 cases reflected in Tuesday’s report represent a significant

sample of veterans who have sought treatment for Gulf War syndrome. Of the

697,000 who served in Operation Desert Storm, 11,000 have complained of illness.

Some 8,000 are being processed.

Ex. 10: ProcedureQ: How do you find the maximum speed of a vehicle?

Paragraph (jrc31995L0001-en, para 125):

The maximum speed of the vehicle is expressed in km/h by the figure corresponding

to the closest whole number to the arithmetical mean of the values for the speeds

measured during the two consecutive tests, which must not diverge by more than

3%. When this arithmetical mean lies exactly between two whole members it is

rounded up to the next highest number.

Ex. 11: PurposeQ: What is the aim of the Kyoto Protocol in relation to greenhouse gas emissions

for 2008-2012?

Paragraph (jrc32001Y0203_02-en, para 168):

The Kyoto Protocol, signed by the Member States and by the Community, provides

that the parties undertake to limit or reduce greenhouse gas emissions during the

Cross-language evaluation forum 2003–2010

123

period 2008-2012. For the Community as a whole, the target is to reduce greenhouse

gas emissions by 8% of their 1990 level.

Ex. 12: ReasonQ: Why are court decisions in Kazakhstan not made public?

Paragraph (jrc21994A1231_52-en, para 1330):

The process of establishing enquiry points has begun. As far as the judicial decisions

and administrative rulings are concerned they are not published in Kazakhstan

(except for some decisions made by the Supreme Court), because they are not

considered to be sources of law. To change the existing practice will require a long

transitional period.

Ex. 13: OtherQ: During a vehicle driver’s rest period, are they entitled to a bunk?

Paragraph (jrc32006R0561-en, para 139):

1. By way of derogation from Article 8, where a driver accompanies a vehicle which

is transported by ferry or train, and takes a regular daily rest period, that period may

be interrupted not more than twice by other activities not exceeding 1 h in total.

During that regular daily rest period the driver shall have access to a bunk or

couchette.

Ex. 14: OpinionQ: What did the Council think about the terrorist attacks on London?

Paragraph (jrc32006L0024-en, para 20):

On 13 July 2005, the Council reaffirmed in its declaration condemning the terrorist

attacks on London the need to adopt common measures on the retention of

telecommunications data as soon as possible.

References

Agirre, E., Ansa, O., Arregi, X., Lopez de Lacalle, M., Otegi, A., Saralegi, Z., et al. (2009). ElhuyarIXA:

Semantic relatedness and crosslingual passage retrieval. In C. Peters, G. di Nunzio, M. Kurimo, Th.

Mandl, D. Mostefa, A. Penas, & G. Roda (Eds.), Multilingual information access evaluation vol. Itext retrieval experiments, Workshop of the cross-language evaluation forum, CLEF 2009, Corfu,

Greece, 30 September–2 October. Lecture notes in computer science 6241. Springer (Revised

selected papers).

Agirre, E., Ansa, O., Arregi, X., Lopez de Lacalle, M., Otegi, A., & Saralegi, X. (2010). Document

expansion for cross-lingual passage retrieval. In M. Braschler, D. Harman, & E. Pianta (Eds.),

Notebook papers of CLEF 2010 LABs and workshops, September 22–23, 2010 Padua, Italy.

Amaral, C., Figueira, H., Martins, A., Mendes, A., Mendes, P., & Pinto, C. (2005). Priberam’s question

answering system for Portuguese. In C. Peters, F. C. Gey, J. Gonzalo, H. Muller, G. J. F. Jones, M.

Kluck, B. Magnini, & M. de Rijke (Eds.), Accessing multilingual information repositories, 6thworkshop of the cross-language evaluation forum, CLEF 2005, Vienna, Austria, September 21–23,

2005 (Revised selected papers).

Amaral, C., Cassan, A., Figueira, H., Martins, A., Mendes, A., Mendes, P., et al. (2007). Priberam’s

question answering system in QA@CLEF 2007. In C. Peters, V. Jijkoun, T. Mandl, H. Muller, D.

W. Oard, A. Penas, & D. Santos (Eds.), Advances in multilingual and multimodal informationretrieval, 8th workshop of the cross-language evaluation forum, CLEF 2007, Budapest, Hungary,

September 19–21, 2007 (Revised selected papers).

A. Penas et al.

123

Amaral, C., Cassan, A., Figueira, H., Martins, A., Mendes, A., Mendes, P., et al. (2008). Priberam’s

question answering system in QA@CLEF 2008. In C. Peters, T. Mandl, V. Petras, A. Penas, H.

Muller, D. Oard, V. Jijkoun, & D. Santos (Eds.), Evaluating systems for multilingual andmultimodal information access, 9th workshop of the cross-language evaluation forum, CLEF 2008,

Aarhus, Denmark, September 17–19, 2008 (Revised selected papers).

Bouma, G., Mur, J., van Noord, G., van der Plas, L., & Tiedemann, J. (2005). Question answering for

Dutch using dependency relations. In C. Peters, F. C. Gey, J. Gonzalo, H. Muller, G. J. F. Jones, M.

Kluck, B. Magnini, & M. de Rijke (Eds.), Accessing multilingual information repositories, 6thworkshop of the cross-language evaluation forum, CLEF 2005, Vienna, Austria, September 21–23,

2005 (Revised selected papers).

Bowden, M., Olteanu, M., Suriyentrakorn, P., d’Silva, T., & Moldovan. D. (2007). Multilingual question

answering through intermediate translation: LCC’s PowerAnswer at QA@CLEF 2007. In C. Peters,

V. Jijkoun, T. Mandl, H. Muller, D.W. Oard, A. Penas, & D. Santos (Eds.), Advances in multilingualand multimodal information retrieval, 8th workshop of the cross-language evaluation forum, CLEF

2007, Budapest, Hungary, September 19–21, 2007 (Revised selected papers).

Cassan, A., Figueira, H., Martins, A., Mendes, A., Mendes, P., Pinto, C., et al. (2006). Priberam’s

question answering system in a cross-language environment. In C. Peters, P. Clough, F. C. Gey, J.

Karlgren, B. Magnini, D. W. Oard, M. de Rijke, & M. Stempfhuber (Eds.), Evaluation ofmultilingual and multi-modal information retrieval, 7th workshop of the cross-language evaluation

forum, CLEF 2006, Alicante, Spain, September 20–22, 2006 (Revised selected papers).

Clarke, C., Cormack, G., & Lynam, T. (2001). Exploiting redundancy in question answering. In

Proceedings of the 24th annual international ACM SIGIR conference on research and developmentin information retrieval (SIGIR-2001). September 9–13, 2001, New Orleans, Louisiana. ACM 2001,

ISBN 1-58113-331-6.

Correa, S., Buscaldi, D., & Rosso, P. (2009). NLEL-MAAT at CLEF-ResPubliQA. In C. Peters, G. di

Nunzio, M. Kurimo, Th. Mandl, D. Mostefa, A. Penas, & G. Roda (Eds.), Multilingual informationaccess evaluation vol. I text retrieval experiments, workshop of the cross-language evaluation

forum, CLEF 2009, Corfu, Greece, September 30–October 2, 2010. Lecture notes in computer

science 6241. Springer (Revised selected papers).

Dornescu, I., Puscasu, G., & Orsasan, C. (2008). University of Wolverhampton at CLEF 2008. In C.

Peters, T. Mandl, V. Petras, A. Penas, H. Muller, D. Oard, V. Jijkoun, & D. Santos (Eds.),

Evaluating systems for multilingual and multimodal information access, 9th workshop of the cross-

language evaluation forum, CLEF 2008, Aarhus, Denmark, September 17–19, 2008 (Revised

selected papers).

Ferrandez, O., Spurk, C., Kouylekov, M., Dornescu, I., Ferrandez, S., Negri, M., Izquierdo, R., Tomas,

D., Orasan, C., Neumann, G., Magnini B., & Vicedo, J. L. (2011). The QALL-ME framework: A

specifiable-domain multilingual question answering architecture. Web semantics: Science, servicesand agents on the world wide web, 9 (2), Provenance in the Semantic Web, 137–145.

Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek, D., Kalyanpur, A. A., et al. (2010). Building

Watson: An overview of the DeepQA project. AI Magazine, 31(3), 59–79.

Forner, P., Penas, A., Alegria, I., Forascu, C., Moreau, N., Osenova, P., et al. (2008). Overview of the

CLEF 2008 multilingual question answering track. In C. Peters, T. Mandl, V. Petras, A. Penas, H.

Muller, D. Oard, V. Jijkoun, & D. Santos (Eds.), Evaluating systems for multilingual andmultimodal information access, 9th workshop of the cross-language evaluation forum, CLEF 2008,

Aarhus, Denmark, September 17–19, 2008 (Revised selected papers).

Giampiccolo, D., Forner, P., Herrera, J., Penas, A., Ayache, C., Forascu, C., et al. (2007). Overview of the

CLEF 2007 Multilingual Question Answering Track. In C. Peters, V. Jijkoun, T. Mandl, H. Muller,

D.W. Oard, A. Penas, & D. Santos (Eds.), Advances in multilingual and multimodal informationretrieval, 8th workshop of the cross-language evaluation forum, CLEF 2007, Budapest, Hungary,

September 19–21, 2007 (Revised selected papers).

Gillard, L., Sitbon, L., Blaudez, E., Bellot, P., & El-Beze, M. (2006). The LIA at QA@CLEF-2006. In C.

Peters, P. Clough, F. C. Gey, J. Karlgren, B. Magnini, D. W. Oard, M. de Rijke, & M. Stempfhuber

(Eds.), Evaluation of multilingual and multi-modal information retrieval, 7th workshop of the cross-

language evaluation forum, CLEF 2006, Alicante, Spain, September 20–22, 2006 (Revised selected

papers).

Gonzalo, J., Clough, P., & Vallin, A. (2005). Overview of the CLEF 2005 interactive track. In C. Peters,

F. C. Gey, J. Gonzalo, H. Muller, G. J. F. Jones, M. Kluck, B. Magnini, & M. de Rijke (Eds.),

Cross-language evaluation forum 2003–2010

123

Accessing multilingual information repositories, 6th workshop of the cross-language evaluationforum, CLEF 2005, Vienna, Austria, September 21–23, 2005 (Revised selected papers).

Hartrumpf, S. (2004). Question answering using sentence parsing and semantic network matching. In C.

Peters, P. Clough, J. Gonzalo, G. J. F. Jones, M. Kluck, & B. Magnini (Eds.), Multilingualinformation access for text, speech and images, 5th workshop of the cross-language evaluation

forum, CLEF 2004, Bath, September 15–17, 2004, (Revised selected papers).

Herrera, J., Penas, A., & Verdejo, F. (2005). Question answering pilot task at CLEF 2004. Multilingualinformation access for text, speech and images. CLEF 2004. Vol. 3491 of lecture notes in computer

science, pp. 581–590.

Hirschman, L., & Gaizauskas. R. (2001). Natural language question answering: the view from here. In

natural language engineering, Vol. 7(4), December 2001, pp. 275–300.

Hovy, E., Gerber, L., Hermjakob, H., Lin, C., & Ravichandran, D. (2001). Toward semantics-based

answer pinpointing. In Proceedings of the DARPA human language technology conference (HLT),San Diego, CA.

Iftene, A., Trandabat, D., Pistol, I., Moruz, A. M., Husarciuc, M., Sterpu, M., & Turliuc, C. (2009).

Question answering on english and romanian languages. In C. Peters, G. di Nunzio, M. Kurimo, Th.

Mandl, D. Mostefa, A. Penas, & G. Roda (Eds.), Multilingual information access evaluation vol. itext retrieval experiments, Workshop of the cross-language evaluation forum, CLEF 2009, Corfu,

Greece, September 30–October 2, 2010. Lecture notes in computer science 6241. Springer (Revised

selected papers).

Ion, R., Stefanescu, D., Ceausu, A., Tufis, D., Irimia, E., & Barbu-Mititelu, V. (2009). A trainable multi-

factored QA system. In C. Peters, G. di Nunzio, M. Kurimo, Th. Mandl, D. Mostefa, A. Penas, & G.

Roda (Eds.), Multilingual information access evaluation vol. I text retrieval experiments, Workshop

of the cross-language evaluation forum, CLEF 2009, Corfu, Greece, September 30–October2, 2010.

Lecture notes in computer science 6241. Springer (Revised selected papers).

Jijkoun, V., & de Rijke, M. (2007). Overview of the WiQA task at CLEF 2006. In C. Peters, P. Clough, F.

C. Gey, J. Karlgren, B. Magnini, D. W. Oard, M. de Rijke, & M. Stempfhuber (Eds.), Evaluation ofmultilingual and multi-modal information retrieval, 7th workshop of the cross-language evaluation

forum, CLEF 2006, Alicante, Spain, September 20–22, 2006 (Revised selected papers).

Jijkoun, V., Mishne, G., & de Rijke, M. (2003). The University of Amsterdam at QA@CLEF2003. In C.

Peters, J. Gonzalo, M. Braschler, & M. Kluck (Eds.), Comparative evaluation of multilingualinformation access systems, 4th workshop of the cross-language evaluation forum, CLEF 2003,

Trondheim, Norway, August 21–22, 2003 (Revised selected papers).

Jijkoun, V., Mishne, G., de Rijke, M., Schlobach, S., Ahn, D., & Muller, H. (2004). The University of

Amsterdam at QA@CLEF 2004. In C. Peters, P. Clough, J. Gonzalo, G. J. F. Jones, M. Kluck, & B.

Magnini (Eds.), Multilingual information access for text, speech and images, 5th workshop of the

cross-language evaluation forum, CLEF 2004, Bath, September 15–17, 2004 (Revised selected

papers).

Kelly, D., & Lin, J. (2007). Overview of the TREC 2006 ciQA task. SIGIR Forum, 41, 1.

Lamel, L., Rosset, S., Ayache, C., Mostefa, D., Turmo, J., Comas P. (2007). Question answering on

speech transcriptions: The QAST evaluation in CLEF. In C. Peters, V. Jijkoun, T. Mandl, H. Muller,

D. W. Oard, & A. Penas, D. Santos (Eds.), Advances in multilingual and multimodal informationretrieval, 8th workshop of the cross-language evaluation forum, CLEF 2007, Budapest, Hungary,

September 19–21, 2007 (Revised selected papers).

Laurent, D., Seguela, P., & Negre, S. (2005). Cross lingual question answering using QRISTAL for CLEF

2005. In C. Peters, F. C. Gey, J. Gonzalo, H. Muller, G. J. F. Jones, M. Kluck, B. Magnini, & M. de

Rijke (Eds.), Accessing multilingual information repositories, 6th workshop of the cross-languageevaluation forum, CLEF 2005, Vienna, Austria, September 21–23, 2005 (Revised selected papers).

Laurent, D., Seguela, P., & Negre, S. (2006). Cross lingual question answer ing using QRISTAL for

CLEF 2006. In C. Peters, P. Clough, F. C. Gey, J. Karlgren, B. Magnini, D. W. Oard, M. de Rijke, &

M. Stempfhuber (Eds.), Evaluation of multilingual and multi-modal information retrieval, 7th

workshop of the cross-language evaluation forum, CLEF 2006, Alicante, Spain, September 20–22,

2006 (Revised selected papers).

Laurent, D., Seguela, P., & Negre, S. (2007). Cross lingual question answering using QRISTAL for CLEF

2007. In: C. Peters, V. Jijkoun, T. Mandl, H. Muller, D.W. Oard, A. Penas, & D. Santos (Eds.),

Advances in multilingual and multimodal information retrieval, 8th workshop of the cross-language

A. Penas et al.

123

evaluation forum, CLEF 2007, Budapest, Hungary, September 19–21, 2007 (Revised selected

papers).

Laurent, D., Seguela, P., & Negre, S. (2008). Cross lingual question answering using QRISTAL for CLEF

2008. In C. Peters, T. Mandl, V. Petras, A. Penas, H. Muller, D. Oard, V. Jijkoun, & D. Santos

(Eds.), Evaluating systems for multilingual and multimodal information access, 9th workshop of the

cross-language evaluation forum, CLEF 2008, Aarhus, Denmark, September 17–19, 2008 (Revised

selected papers).

Lopez, V., Uren, V. S., Sabou, M., & Motta, E. (2009). Cross ontology query answering on the semantic

web: An initial evaluation. In K-CAP-2009: Proceedings of the fifth international conference onknowledge capture, Redondo Beach, CA.

Magnini, B., Negri, M., Prevete, R., & Tanev, H. (2002). Is it the right answer? Exploiting web

redundancy for answer validation In Proceedings of the 40th annual meeting of the association forcomputational linguistics (ACL), Philadelphia. doi:10.3115/1073083.1073154.

Magnini, B., Romagnoli, S., Vallin, A., Herrera, J., Penas, A., Peinado, V., et al. (2003). The multiple

language question answering track at CLEF 2003. In C. Peters, J. Gonzalo, M. Braschler, & M.

Kluck (Eds.), Comparative evaluation of multilingual information access systems, 4th workshop of

the cross-language evaluation forum, CLEF 2003, Trondheim, Norway, August 21–22, 2003

(Revised selected papers).

Magnini, B., Vallin, A., Ayache, C., Erbach, G., Penas, A., de Rijke, M., Rocha, P., Simov, K., &

Sutcliffe, R. (2004). Overview of the CLEF 2004 multilingual question answering track. In C.

Peters, P. Clough, J. Gonzalo, G. J. F. Jones, M. Kluck, & B. Magnini (Eds.), Multilingualinformation access for text, speech and images, 5th workshop of the cross-language evaluation

forum, CLEF 2004, Bath, UK, September 15–17, 2004 (Revised selected papers).

Magnini, B., Giampiccolo, D., Forner, P., Ayache, C., Jijkoun, V., Osenova, P., et al. (2006). Overview of

the CLEF 2006 multilingual question answering track. In C. Peters, P. Clough, F. C. Gey, J.

Karlgren, B. Magnini, D. W. Oard, M. de Rijke, & M. Stempfhuber (Eds.), Evaluation ofmultilingual and multi-modal information retrieval, 7th Workshop of the cross-language evaluation

forum, CLEF 2006, Alicante, September 20–22, 2006 (Revised selected papers).

Magnini, B., Speranza, M., & Kumar, V. (2009). Towards interactive question answering: an ontology-

based approach. In Proceedings of the IEEE international conference on semantic computing,

September 22–24, 2010, Carnegie Mellon University, Pittsburgh, PA.

Negri, M., Tanev, H., & Magnini, B. (2003) Bridging Languages for Question Answering: DIOGENE at

CLEF-2003. In C. Peters, J. Gonzalo, M. Braschler, & M. Kluck (Eds.), Comparative evaluation ofmultilingual information access systems, 4th workshop of the cross-language evaluation forum,

CLEF 2003, Trondheim, August 21–22, 2003 (Revised selected papers).

Noguera, E., Llopis, F., Ferrandez, A., & Escapa, A. (2007). Evaluation of open-domain question

answering systems within a time constraint. In Advanced Information networking and applicationsworkshops, AINAW ‘07. 21st international conference.

Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: A method for automatic evaluation of

machine translation. In Proceedings of the 40th annual meeting of the association for computationallinguistics (ACL), Philadelphia, July 2002.

Penas, A., & Rodrigo, A. (2011) A Simple Measure to Assess Non-response. In Proceedings of 49thannual meeting of the association for computational linguistics—human language technologies(ACL-HLT 2011), Portland, Oregon, June 19–24, 2011.

Penas, A., Rodrigo, A., Sama, V., & Verdejo, F. (2006). Overview of the answer validation exercise 2006.

In C. Peters, P. Clough, F. C. Gey, J. Karlgren, B. Magnini, D. W. Oard, M. de Rijke, & M.

Stempfhuber (Eds.), Evaluation of multilingual and multi-modal information retrieval, 7th

workshop of the cross-language evaluation forum, CLEF 2006, Alicante, Spain, September

20–22, 2006 (Revised selected papers).

Penas, A., Rodrigo, A., & Verdejo, F. (2007). Overview of the Answer Validation Exercise 2007. In

C. Peters, V. Jijkoun, T. Mandl, H. Muller, D. W. Oard, A. Penas, V. Petras, & D. Santos,

(Eds.), Advances in multilingual and multimodal information retrieval, LNCS 5152, September

2008.

Penas, A., Forner, P., Sutcliffe, R., Rodrigo, A., Forascu, C., Alegria, I., et al. (2009). Overview of

ResPubliQA 2009: Question answering evaluation over European legislation. In C. Peters, G. di

Nunzio, M. Kurimo, Th. Mandl, D. Mostefa, A. Penas, & G. Roda (Eds.), Multilingual informationaccess evaluation vol. i text retrieval experiments, workshop of the cross-language evaluation

Cross-language evaluation forum 2003–2010

123

forum, CLEF 2009, Corfu, Greece, September 30–October 2. Lecture Notes in computer science

6241. Springer-Verlag, 2010. (Revised selected papers).

Penas, A., Forner, P., Rodrigo, A., Sutcliffe, R., Forascu, & Mota, C. (2010). Overview of ResPubliQA

2010: Question answering evaluation over european legislation. In M. Braschler, D. Harman, and E.

Pianta (Eds,), Notebook papers of CLEF 2010 LABs and workshops, September 22–23, 2010, Padua,

Italy.

Perez-Iglesias, J, Garrido, G., Rodrigo, A, Araujo, L., & Penas, A. (2009). Information retrieval baselines

for the ResPubliQA Task. In Borri, F., Nardi, A., & Peters, C. (Eds.), Cross language evaluationforum: Working notes of CLEF 2009, Corfu, Greece, September 30–October 2.

Prager, J. (2006). Open-domain question-answering, Foundations and trends in information retrieval 1(2),

91–231. http:/dx.doi.org/10.1561/1500000001.

Prager, J., Brown, E., & Coden, A. (2000). Question-answering by predictive annotation. In Proceedingsof the 23rd annual international ACM SIGIR conference on research and development ininformation retrieval, Athens.

Rodrigo, A., Penas, A., & Verdejo, F. (2008). Overview of the answer validation exercise 2008. In C.

Peters, T. Mandl, V. Petras, A. Penas, H. Muller, D. Oard, V. Jijkoun, & D. Santos (Eds.),

Evaluating systems for multilingual and multimodal information access, 9th workshop of the cross-

language evaluation forum, CLEF 2008, Aarhus, September 17–19, 2008 (Revised selected papers).

Rodrigo, A., Perez, J., Penas, A., Garrido, G., & Araujo, L. (2009). Approaching question answering by

means of paragraph validation. In C. Peters, G. di Nunzio, M. Kurimo, Th. Mandl, D. Mostefa, A.

Penas, & G. Roda (Eds.), Multilingual information access evaluation vol. I text retrievalexperiments, workshop of the Cross-Language Evaluation Forum, CLEF 2009, Corfu, Greece,

September 30–October 2, 2010. Lecture notes in computer science 6241. Springer. Revised selected

papers).

Saias, J., & Quaresma, P. (2007). The senso question answering approach to Portuguese QA@CLEF-

2007. In C. Peters, V. Jijkoun, T. Mandl, H. Muller, D. W. Oard, A. Penas, & D. Santos (Eds.),

Advances in multilingual and multimodal information retrieval, 8th workshop of the cross-language

evaluation forum, CLEF 2007, Budapest, September 19–21, 2007 (Revised selected papers).

Saias, J., & Quaresma, P. (2008) The senso question answering system at QA@CLEF 2008. In C.

Peters, T. Mandl, V. Petras, A. Penas, H. Muller, D. Oard, V. Jijkoun, & D. Santos (Eds.),

Evaluating systems for multilingual and multimodal information access, 9th workshop of the

cross-language evaluation forum, CLEF 2008, Aarhus, September 17–19, 2008 (Revised selected

papers).

Santos, D., & Cabral, L. M. (2009). GikiCLEF: Crosscultural issues in an international setting:

Asking non-English-centered questions to Wikipedia. In Borri, F., Nardi, A., & Peters, C. (Eds.),

Cross language evaluation forum: working notes of CLEF 2009, Corfu, Greece, September 30–

October 2.

Toba, H., Sari, S., Adriani, M., & Manurung, M. (2010). Contextual approach for paragraph selection in

question answering task. In M. Braschler, D. Harman, & E. Pianta (Eds.), Notebook papers of CLEF2010 LABs and workshops, September 22–23, 2010, Padua, Italy.

Vallin, A., Magnini, B., Giampiccolo, D., Aunimo, L., Ayache, C., & Osenova, P. (2005). Overview of

the CLEF 2005 multilingual question answering track. In C. Peters, F. C. Gey, J. Gonzalo, H.

Muller, G. J. F. Jones, M. Kluck, B. Magnini, & M. de Rijke (Eds.), Accessing multilingualinformation repositories, 6th workshop of the cross-language evaluation forum, CLEF 2005,

Vienna, Austria, September 21–23, 2005 (Revised selected papers).

van Rijsbergen, K. J. (1979). Information retrieval. London: Butterworth.

Vicedo, J. L., Izquierdo, R., Llopis, F., & Munoz, R. (2003). Question answering in Spanish. In C. Peters,

J. Gonzalo, M. Braschler, & M. Kluck (Eds.), Comparative evaluation of multilingual informationaccess systems, 4th workshop of the cross-language evaluation forum, CLEF 2003, Trondheim,

Norway, August 21–22, 2003 (Revised selected papers).

Vicedo, J.L., Saiz, M., & Izquierdo, R. (2004). Does english help question answering in Spanish? In C.

Peters, P. Clough, J. Gonzalo, G. J. F. Jones, M. Kluck, & B. Magnini (Eds.), Multilingualinformation access for text, speech and images, 5th workshop of the cross-language evaluation

forum, CLEF 2004, Bath, UK, September 15–17, 2004 (Revised selected papers).

Voorhees, E. M. (2000). Overview of the TREC9 question answering track. In Proceedings of the NinthText Retrieval Conference (TREC-9).

A. Penas et al.

123

Voorhees, E. M. (2002). Overview of the TREC 2002 Question answering track. In Proceedings of theeleventh text retrieval conference (TREC-11).

Voorhees, E. M., Tice, D. M. (1999). The TREC-8 question answering track evaluation. In Proceedingsof the eight text retrieval conference (TREC-8).

Webb, N., & Webber, B. (Eds.) (2009). Special issue on interactive question answering. Journal ofNatural Language Engineering, 15(1), Cambridge University Press.

Cross-language evaluation forum 2003–2010

123