qall-me – pattern acquisition for question...

QallQall -- MM e e –– Pattern AcquisitionPattern Acquisition for Question for Question AnsweringAnswering

Author:Author: QallQall -- Me ConsortiumMe Consortium

Affiliation:Affiliation: FBK, University of WolverhamptonFBK, University of Wolverhampton , University , University

of Alicante, DFKIof Alicante, DFKI , Comdata, Ubiest, Waycom, Comdata, Ubiest, Waycom

Keywords: Keywords: question answering, question answering, knowledge bknowledge b otot tt leneck, leneck, pattern pattern

aa cquisition, textual entailmentcquisition, textual entailment

Abstract: Abstract: This document reports on This document reports on the solutions proposed in the Qallthe solutions proposed in the Qall -- Me Me

projectproject for automatic ac for automatic ac quisition of pattern to be used quisition of pattern to be used in the Textual Entailment in the Textual Entailment

approach for Question Answering approach for Question Answering adoado pteptedd by the project. by the project. More specifically, we More specifically, we

report on three approaches: report on three approaches: (i) acquisition of patterns from a corpus of (i) acquisition of patterns from a corpus of

questions; (ii) acquisition of patterns from the Qallquestions; (ii) acquisition of patterns from the Qall -- Me domain ontology; Me domain ontology; (iii)(iii)

acquisition of patterns from the Web.acquisition of patterns from the Web.

FP6 IST-033860

http://qallme.itc.it

FP6 IST-033860

D4.3

1. Introduction ....................................................................................................1

2. Acquisition of relational patterns from questions .......................................3

2.1. A fast prototyping experiment ..........................................................................3 2.2. Automatic acquisition of relational patterns from questions ............................5

2.2.1. Analysis of hand-crafted patterns .............................................................5

2.2.2. Supervised acquisition of MRPs...............................................................6

2.2.3. Experiments and results ..........................................................................11

3. Acquisition of patterns from the Qall-Me Ontology ....................................16

3.1. Question pattern generation............................................................................16 3.2 Question answering .........................................................................................19 3.3 Evaluation and results......................................................................................20 3.4 Results and discussion .....................................................................................22 4. Acquisition of patterns from the Web ...........................................................24

4.1. Designing a QALL-ME style system .............................................................24 4.2. System development bottlenecks....................................................................26 4.3. Automatic domain knowledge acquisition .....................................................26 4.4. A domain adaptable Textual Entailment engine.............................................28 4.5. Comparative evaluation ..................................................................................29 5. References.........................................................................................................31

FP6 IST-033860

Page 1 of 34

1.1. IntroductionIntroduction

As stated in the QALL-ME Annex I (Description of Work, p.34), the objectives of WP5 on “Multilingual Answer Extraction components” are: i) “to extract the answer to the question”, ii) “evaluate its reliability”, and iii) “return it to the QA planner in the appropriate format”, considering also the context of the question (i.e. time and location). To achieve these objectives, WP5 is organized in three development cycles, allowing the involved partners to incrementally define and improve their answer extraction components. In particular, we report on the reduction of knowledge acquisition bottlenecks to enhance portability and scalability of the proposed approach. Within WP5, the main achievement of the first development cycle of the Project was the definition of an Entailment-based (TE-based in the rest of this document) approach as a common framework to develop the QALL-ME infrastructure for QA over structured data. According to the TE-based framework the QA problem is recast as an entailment recognition problem, where the text (T) is the question, and the hypothesis (H) is represented by textual material (either a full question or a single relation pattern) associated to instructions (SPARQL queries to a database) for retrieving the answer to the input question. The backbone of the entailment-based QA architecture developed during the first year of the Project is depicted in Figure 1. The general implementation of such architecture was driven by the project’s aim to “develop a shared infrastructure for multilingual QA based on a Web Services architecture”. To fit this requirement, each Answer Extraction component was designed to be pluggable in a distributed architecture. Given an input question q, the QALL-ME central QA planner interacts with four separate language-specific Answer Extractors on the basis of shared procedures and I/O formats. More specifically, once the central planner has processed q in its source language, the search for the answer is routed to the appropriate Entailment-based Answer Extractor. While the modalities of interaction with the planner are shared, each Answer Extractor implements its own entailment-recognition strategy. During the second development cycle, according to the research plans reported in the QALL-ME Technical Annex, the Consortium addressed three main directions, namely: i) the improvement of specialized RTE components for each language (also in view of the planned demonstration initiatives), ii) the development of a general-purpose RTE package (to provide all the partners with a shared RTE component for comparative evaluation), and iii) the implementation of advanced reasoning components (to improve answer extraction and enhance effective answer presentation strategies).

FP6 IST-033860

Page 2 of 34

Figure 1: The Entailment-based Answer Extraction Process.

Building on the progress made in the first two years of the project (which demonstrated the feasibility of the TE-based approach proposed by QALL-ME), the third development cycle aimed at putting the basis for the actual exploitation of the Project’s foreground. Along such direction, the tasks addressed by the academic partners involved in WP5, and the corresponding achievements (which are more extensively described in the following sections) can be summarized as follows: Reduction of knowledge acquisition bottlenecks. The objective of such activity was the implementation of minimally supervised techniques for automatic pattern acquisition1. This work is motivated by portability and scalability reasons. The possibility of instantiating pattern repositories with automatically acquired patterns, in fact, represents an enabling factor for the fast prototyping of QALL-ME applications in different languages, domains, or larger-scale scenarios. Experiments with different automatic pattern extraction techniques led to positive results (e.g. for Italian, around 89% of real users’ test questions in the CINEMA domain are properly handled by the system), even outperforming those achieved with human-generated pattern repositories. This task has been mainly addressed by: FBK (with experiments on portability of the QALL-ME approach across domains, and with the implementation of a method for automatic pattern acquisition, respectively reported in Sections 2), WLV (through experiments on the automatic acquisition of domain-specific questions, as described in Section 3), and UA (through experiments on automatic pattern acquisition, as described in Section 4).

1 Automatic acquisition techniques have been actually implemented not only to acquire single relation patterns (as described in Section 4.2.), but also full questions (as reported in Section 5.1.)

FP6 IST-033860

Page 3 of 34

2.2. Acquisition of relational patterns from Acquisition of relational patterns from questionsquestions

This section reports on the work carried out at FBK on the acquisition of relationl patterns, during the third year of the project.

2.1. A fast prototyping experiment2.1. A fast prototyping experiment

The automatic acquisition of Relational Patterns represents one of the crucial aspects towards the consolidation of our approach, and its possible exploitation in real world (possibly large-scale) scenarios after the end of the project. While a number of experiments demonstrated the good potentialities of the approach with repositories of handcrafted (high precision, high coverage) patterns, in the first two years of the project little has been done to estimate its inherent limitations. Among these, containing the costs of knowledge acquisition still represented an open issue. Along this direction, a preliminary study has been carried out with the purpose of roughly estimating the overall cost of porting a fully manual approach to a new domain. To this aim, an Italian native speaker has been hired for a limited period (~40 hours), during which he was asked to acquire the knowledge required for porting the QALL-ME approach to the ACCOMMODATION domain (ACCO), replicating the work previously carried out on the CINEMA domain (i.e.: i) collect questions for the new domain, ii) annotate them with named entities and relations, and iii) create a repository of patterns for each relevant domain relation). The ACCOMMODATION domain was selected, among those addressed by the project due to its similarity, in terms of complexity, to the CINEMA domain (CIN). The expected benefit of this choice is the possibility of more meaningful comparisons between results. To have an insight about the differences between the two domains, we considered simple complexity indicators such as: The number of domain relations (CIN: 13, ACCO: 11); Vocabulary size (CIN: 150 ; ACCO: 120); Avg. number of relations per question (CIN: 2.05 ; ACCO: 1.44) Avg. question length (CIN: 7.74 ; ACCO: 9.45)

According to most of these indicators, though substantially similar, the ACCO domain seems to be slightly easier to model than the CIN domain. If this hypothesis is true, we expect slightly better results with the new domain with pattern repositories of comparable size. The outcome of this short time knowledge acquisition experiment was a collection of: 232 domain-specific questions annotated with named entities and relations, 18 on

average per relation (required time: ~32 hours, avg. 7.25 questions per hour) 144 patterns, 13 on average per relation (required time: ~8 hours, avg. 18 patterns

per hour)

FP6 IST-033860

Page 4 of 34

System’s performance with the acquired patterns has been assessed over a set of 50 test questions (which were not used for pattern extraction). In terms of exact matches2, which represent a good indicator of system’s capabilities in a QA application, 92% of the questions where appropriately handled by the system. The comparison with the 88.15% exact matches achieved on the CIN domain (with the same configuration of the entailment engine) allows us to draw the following considerations: Simple complexity indicators (the number of ontology relations in a given domain,

and basic features of the training questions) can provide reliable insights about the complexity of a new domain. As expected, on the easier ACCO domain we have a performance improvement, even with less patterns (144 for ACCO vs. 149 for CIN), and without any tuning of the system;

Domains of similar complexity can be effectively modeled with similar amounts of patterns. Also for the ACCO domain, in fact, less than 20 patterns per relation allow for high precision QA (for the CIN domain we manually created around 11 patterns per relation);

Given a set of domain-specific questions already annotated with named entities, an approximate estimation of the time required for manual pattern creation reveals an average speed of 15-20 patterns per hour.

As a general conclusion, we can expect that at least for a domain of similar complexity (in terms of relations, vocabulary size, etc.) the amount of work required for manual porting can be roughly estimated in a couple of days (75% of this time is required for question collection/annotation, and 25% for pattern creation). Even though this is a relatively limited effort, the knowledge acquisition process might still represent a bottleneck for concrete exploitation purposes. The adoption of the manual approach, in fact, is potentially detrimental (i.e. too expensive) when scalability to more complex domains, and portability across languages are a matter of concern. As a consequence, also the real exploitation potential of the entailment-based approach still remains limited. From an exploitation perspective, even tough it requires around 2/3 of the time, the collection of domain-specific questions does not represent the most critical aspect of the prototyping process. During the first two years of the project, in fact, most of this time was required for actually “inventing” questions rather than annotating them. However, on the one side this situation probably does not represent the real situation of potential customers interested in a QALL-ME-based service to access their data. On the basis of our experience, in fact, usually potential customers already have samples (often huge collections) of questions they need to automatically manage. On the other side, when training questions are not available, the QALL-ME experience with focus groups involving real users asking questions at the telephone resulted in an effective, fast, and relatively cheap way to collect data.

2 The exact match score (see D5.2) represents the amount of test questions for which all and only the relations annotated in the gold standard are recognized by the system. For these questions, the resulting database query will then contain all and only the appropriate restrictions expressed by the input question. The exact match score gives a clearer idea of the amount of questions that would be actually answered by the system, independently from the availability of data. This represents a better assessment of the system’s performance, since such information cannot be precisely induced from Precision/Recall/F-measure scores.

FP6 IST-033860

Page 5 of 34

In contrast, the pattern acquisition process has considerable room for improvement, also representing an interesting research direction. First of all, a potential reduction of 25% in the prototyping time should not be underestimated. Secondly, the possibility of achieving good performance results (comparable, or even exceeding the results achieved by human-generated patterns) in an automatic way represents a further motivation and a challenging research issue. Working on this direction we aim at being able, in the ideal situation, to generate in a short time a reliable pattern repository simply starting from a set of domain-specific questions annotated with relations. To this aim, effective automatic pattern acquisition techniques become definitely a pro.

2.2. Automatic acquisition of relational patterns 2.2. Automatic acquisition of relational patterns from from questionsquestions

This Section reports on our work on automatic pattern acquisition, and the implementation of a supervised approach to learn relational patterns from questions annotated with domain entities and relations. Building on the definition of Minimal Relational Pattern3 (MRP) already proposed in deliverables D5.1 and D5.2, the implementation of our pattern acquisition component has been carried out in two phases, namely: i) the analysis of the main features of the MRPs manually acquired during the first two development cycles, and ii) the actual implementation, building on this analysis, of a supervised approach to acquire patterns with similar features.

2.2.1. Analysis of hand2.2.1. Analysis of hand -- crafted patternscrafted patterns

Having a look at the handcrafted patterns collected for previous experiments, we can observe that: All the patterns are subsequences of the training questions, where word order is preserved (e.g. Q: “What’s on at Cinema Modena tonight?” – P1: “What’s on at [CINEMA]”, P2: “on at [CINEMA]”); While word order is preserved, pattern terms are often not consecutive in the original questions (e.g. Q: “What’s on tonight at Cinema Modena?” – P1: “What’s on at [CINEMA]”, P2: “on at [CINEMA]”). As can be seen from the example, allowing word gaps prevents from creating multiple-relation patterns (i.e. non-minimal patterns such as “on [DATE] at [CINEMA]”);

3 We say that a relational pattern p expresses a relation R(arg1, arg2) in a certain language L if speakers of L agree that the meaning of p expresses the relation R between arg1 and arg2, given their knowledge about the entities. Minimal Relational Patterns, are relational patterns that have the additional property of representing only one relation. MRPs can be formally defined in terms of TE as follows: given two sets of relational patterns P1 and P2 for the relations R1 and R2, a pattern pk belonging to P1 is a MRP for R1 if condition (1) holds.

∀pi ∈ P2, pk → pi = ∅ In other words, a pattern p is minimal for a relation R if none of the patterns for the other relations can be derived from p (i.e. is logically entailed by p).

FP6 IST-033860

Page 6 of 34

Original question terms (especially prepositions) are often lemmatized in the generated patterns (e.g. Q: “Cosa danno al Cinema Modena?” – P: “Cosa dare a [CINEMA]”); Some parts of speech (e.g. determiners and adjectives) rarely appear in the patterns, and can be considered as stop-words in the acquisition process; The average length of the manually created patterns is 4.2 words (min 2, max, 8); At first glance, patterns could be even shorter: single words or entities are often clues of the actual presence of a relation R in an input question (e.g. “star”, “[STAR]”, “telephone”, “address”, “director”, [GENRE]). In some cases, single-word “patterns” are the only possible solution (for instance, given “Chi ha diretto la commedia di stasera al Vittoria?”, the only clue for the HasGenre(Movie,Genre) relation is the word commedia or, better, the entity [GENRE]). In light of that, effective “patterns” do not necessarily have to be long word sequences or complex phrases; Most of the patterns are full phrases (NPs, PPs), but non grammatical expressions also appear (e.g. P: “[DATE] [MOVIE]”; P: “[DATE] film”). These observations provided us with useful insights in the design of our automated pattern acquisition process.

2.2.2. Supervised acquisition of MRPs2.2.2. Supervised acquisition of MRPs

Building on the previous considerations, an n-gram based approach to automated pattern acquisition has been adopted. Our supervised approach starts from a collection of questions annotated with named entities and relations, which is similar to the training data used for manual pattern acquisition. The process is carried out in three steps: i) pattern collection, ii) correlation check, iii) pattern weighting, and iv) pattern filtering. STEP 1: patterns collection. The first step aims at collecting a large quantity of candidate relational patterns in the form of raw textual material (n-grams). For this purpose, a large collection of n-grams (with n=2, 3,…,k, where k can be a fixed value, equal either to the length of the longest training question, or the average length of the training questions) is extracted from the annotated training questions. Two types of n-grams are collected: i) sequences of consecutive word (e.g. “w1 w2 w3 w4”), and ii) word sequences with gaps (e.g. . “w1 _ _ w4”, “w1 _ w3 _ w5”). As an example, the n-grams extracted from the question “what movie can I see at [CINEMA] [DATE]?” include: 2-grams: what movie, movie can, can I, I see, … 3-grams: what movie can, movie can I, … what _ can, movie _ I … 4-grams: what movie can I, movie can I see, … what _ can I, what movie _ I, movie can _ see, can I _ at, … what _ _ I, movie _ _ see, can _ _ at, … 5-grams: what movie can I see, movie can I see at, can I see at [CINEMA] … what _ can I see, what movie _ I see, what movie can _ see,… what movie _ _ see, what movie _ _ see, … what _ _ _ see, …

FP6 IST-033860

Page 7 of 34

The large number of n-grams collected in this way4 contains the following word sequences, which actually represent reasonable patterns for two relations in the CINEMA domain: movie _ _ _ at [CINEMA] IsInSite(Movie,Cinema) what movie can I see at [CINEMA] IsInSite (Movie,Cinema) what _ can I see at [CINEMA] IsInSite (Movie,Cinema) movie _ _ _ _ _ [DATE] HasDate(Movie,Date) movie can I see [DATE] HasDate (Movie,Date) Even though some parameters of the n-gram collection process could be subject to empirical estimation through iterative evaluations on the training data, this aspect has not been investigated yet. The maximum size of the n-grams to be collected (i.e. the maximum value of k), and the optimal size of the window from which n-grams are extracted (the value of x) have been set to the fixed value of 12. This would potentially lead to a maximum of 1.743.441 patterns (4083*427); however, since the average question length is ~7.7 words, the actual number of collected n-grams is considerably lower. STEP 2: correlation check. The second step of the process aims at finding correlations between the collected n-grams and the relevant domain relations. The goal is to filter the n-grams in order to retain only those that feature a high correlation with one or more relations (i.e. those whose occurrence in the input questions often activates these relations). The idea is that the higher the correlation, the higher the quality of a candidate pattern. More specifically, patterns featuring a maximal correlation with a single relation R (i.e. always activating the relation R) will be considered as MRPs for R, while patterns featuring a maximal correlation with multiple relations R1,…,Rn will be considered as relational patterns (RPs) for R1,…Rn. As a first approximation, the filtering process considers, as an indicator of correlation, the ratio (z) between the number of appearances of a pattern p, and the actual activation of each relation R in the training questions. From a “minimality” perspective, we assume that the shorter the list of triggered relations, the better the pattern: if the appearance of p always activates a single relation R (i.e. every time a training question contains p, the question is annotated only with R), then we can assume that p is minimal for R. Non-minimal patterns will activate multiple relations:

4 As far as the number of collected n-grams is concerned, an upper bound can be roughly estimated by calculating all the k-combinations (xCk, with k=2,3,…) of x elements (x!/k!(x-k)!). For instance, setting x=13 (approximately the average length of the Italian and English questions in the QALL_ME Benchmark, without the Named Entities annotated), and k=2,…,7 (we want to collect n-grams of size 2,3,…,7), if we have a training set of 999 questions (as in Gretter et al. 2008, and Negri and Kouylekov, 2009) we obtain: Average number of combinations (n-grams) per question Q: 13C2 + 13C3 + … + 13C7=5798 Total number of n-grams: 5798*999=5.792.202 If we allow for n-grams of length up to 13, we obtain 8.169.822 n-grams (8178*999) In case of the Italian question dataset for the CINEMA domain (427 training questions, average length 7 with annotated Named Entities) we will obtain around 51.000 n-grams with k=2,…,7 (i.e. 7C2+7C3+…+7C7=120; 120*427=51.240). Even though even with these numbers the overall approach is computationally feasible, it’s worth noting that these figures just represent an upper bound, where n-grams repetitions are excluded. However, our dataset (and, in general, any restricted domain QA scenario) features a large number of repeated word sequences. As a consequence, in the actual pattern acquisition process we will always expect a much smaller number of extracted candidates.

FP6 IST-033860

Page 8 of 34

even though their usefulness is limited for our purposes, they don’t have necessarily to be rejected5 . To clarify how the correlation check is carried out, consider the following situation: a list of 4 relations for which we want to automatically acquire patterns; px is a collected n-gram (candidate pattern), which appears 50 times in the annotated training questions; 35 training questions are marked with R1; 20 with R2; 10 with R3, 37 with R4; the appearances of px in the 4 groups of questions are respectively 11, 0, 2, 37 The corresponding values of z, also reported in Table 1, are: 0.31 for R1 (11/35), 0 for R2 (0/20), 0.2 for R3 (2/10), and 1 for R4 (37/37).

#Qs per Rel Appearances of px z

R1 35 11 0.31 R2 20 0 0

R3 10 2 0.2

R4 37 37 1

Total 50

Table 1: assignment of correlation scores.

For each candidate pattern acquired, different situations are possible with respect to the four domain relations in our example. Table 2 illustrates the different possibilities, associated with the corresponding decision on retaining or not each candidate pattern.

Candidate pattern R1 R2 R3 R4 Decision

p1 0 0 1 0 MRP for R3

p2 1 0.3 0.6 1 RP for R1 and R4

p3 0.11 0.2 0.37 0.5 -

p4 0 0.9 0 0 MRP for R2??? …

px 0.31 0 0.2 1 MRP for R4

…

Table 2: selection of MRPs While retaining patterns assigned to score 1 (either for single or multiple relations) should be relatively safe, the decision about patterns with scores close to 1 (as p4 in

5 The value of non-minimal patters relies in allowing to capture implicit relations (e.g. “movies tonight in Trento”, where Rel(Movie,Cinema) is not explicitly activated by a text portion, but is indirectly conveyed by the pattern “in [DESTINATION]”, for the Rel(CINEMA,DESTINATION) relation), and relation co-occurrences.

FP6 IST-033860

Page 9 of 34

Table 2) is more difficult. In these cases a high, but not maximal correlation score might depend on errors in the training set (missing relation in the annotation). However, adopting a precision-oriented approach as a first approximation, only patterns assigned to score 1 (i.e. those, as p1, p2, and px, that feature a maximal correlation, z=1, with one or more relations) have been retained in our acquisition experiments, leaving the more difficult cases for further experiments.

STEP 3: pattern weighting. The third step aims at assigning reliability scores to the selected patterns. Weighting mechanisms have been designed to reward patterns that are potentially more precise, penalizing less precise ones. Precision-oriented pattern weighting is motivated by the fact that low-precision patterns will likely lead to spuriously recognized relations, due to a higher number of false positives in entailment detection. Spurious relations, adding wrong restrictions to the resulting database query, will severely impact on the chances of answering questions. To tackle this problem, it’s worth noting that a maximal correlation score (z=1) is a good indicator of precision, but still represents a necessary (not sufficient) condition. Problems may arise, for instance, from out-of-domain (OOD) questions actually entailing patterns for a valid domain relation. Consider, for instance, the following two patterns for the relation IsInSite(Movie,Cinema): p1: “What can I see” apparently a good pattern p2: “What can I see at [CINEMA]” a high-precision pattern Considering a training set limited to the CINEMA domain, p1 is apparently a good pattern for IsInSite(Movie,Cinema). However, retaining p1 in our pattern repository will likely lead to errors in case of OOD test questions such as “What can I see at [THEATRE]?”, or “What can I see in [MUSEUM]” (which, in the CINEMA domain, should actually return a null answer). To address these issues precision-oriented pattern weighting mechanisms have been implemented considering, first of all, statistical evidence: Patterns whose uni/bi/tri-grams appear more frequently in the training questions

marked with a relation Ri receive higher scores for Ri (calculated, for each relation R, in terms of tf*idf over the training set). It’s worth noting that, combining tf*idf scores of uni/bi/tri-grams, longer patterns are not penalized with respect to single-word patterns (as it would happen considering just tf*idf of single words). This, in most of the cases, leads to longer and more readable phrase-based patterns.

The other aspect considered for pattern weighting is related to their “cohesion”, which is considered as an indicator of their lexical plausibility: Patterns containing a lower number of gaps in the training questions receive a higher

score (calculated as a real 0≤i≤1, which is inversely proportional to the number of word gaps)

STEP 4: patterns filtering. The weighted patterns are finally filtered in order to retain the best patterns to be stored in the repository. To this aim, two types of filtering mechanisms are applied: i) heuristic-based, and ii) performance-based. Heuristic-based filters:

FP6 IST-033860

Page 10 of 34

Weight-based filtering. Goal: retain relation-specific patterns. How: a pattern valid for multiple relations Rx,Ry,Rz is retained only for the relation for which it received the highest weight.

Stop words/repetitions filtering. Goal: retain the more plausible patterns. How: patterns starting or ending with stop-words (determiners, pronouns, and prepositions, e.g. “what’s on [date] at”), and patterns containing word repetitions are filtered out.

Minimality filtering. Goal: retain only MRPs. How: patterns for Rx that contain patterns for other relations are filtered out.

Inclusion filtering. Goal: retain larger valid patterns and increase language variability. How: patterns with a lower weight for a relation Rx, that are included into patterns with a higher weight for the same relation Rx, are filtered out.

Merge filtering. Goal: retain larger valid patterns and increase variability. How: if a pattern px for a relation R represents the union of the words of R’s patterns py and pz with a higher weight, py and pz are removed, while px is retained.

Performance-based filter (Pattern Selection). Performance-based filtering aims at retaining only the less noisy patterns, isolating the subset of the acquired n-grams that best performs on the training set. To this aim, we have experimented with different algorithms for subset selection. The trivial greedy approach is to generate all possible subsets (i.e. the Power Set) of patterns, and to evaluate the performance of each subset on the training data, in order to select the best one. However, this approach is not feasible for large numbers of patterns because of its computational cost (the complexity of the greedy approach is exponential, as O(2n)). For instance, while extracting the best subset from a set of 21 candidate patterns requires around two days6 (the power set of 21 patterns is 2^21, which equals to 2097152 subsets of patterns), the same operation on a set of 22 candidates requires more than 4 days. For this reason, genetic algorithms have been considered as a potentially more suitable approach to the problem. A genetic algorithm (GA) is a technique used in computational biology to find exact or approximate solutions to optimization and search problems. The algorithm is based on an optimization algorithm that evolves a population of abstract representations (i.e. genetic representations of solutions, called chromosomes) of candidate solutions to an optimization problem toward better solutions. Each generated solution is evaluated by means of (problem dependent) a fitness function that measures its quality. Once we have the genetic representation and the fitness function defined, GA proceeds to initialize a population of solutions randomly, then improves it through repetitive application of four basic functions for the generation and evaluation of new population: mutation, crossover, and selection operators. To experiment with the GA approach, we used Java Genetic Algorithms Package (JGAP, freely available under GPL License at http://jgap.sourceforge.net/). We have represented the subset of patterns as a boolean vector corresponding to each pattern from the original set of patterns (vector of values true or false) in which all the patterns from this subset correspond to value true in the vector and all other as false. The fitness function for a subset of patterns is the performance in terms of accuracy of this subset on the training set. Although the genetic algorithm does not guarantee to always select the best subset of patterns, it presents a good approximation with reasonable

6 Using an INTEL Core(TM) 2 Quad CPU @ 2.66GHz

FP6 IST-033860

Page 11 of 34

time for execution (e.g. in contrast with the two days required by the greedy approach, JGAP needs 6’45’’ to compute pattern selection over 21 patterns, and 7’30’’ with 22 candidates).

2.2.3. Experiments and results2.2.3. Experiments and results

This section shortly reports on some of the experiments carried out to evaluate our acquisition procedure. All the experiments are carried out with patterns automatically extracted from training sets of Italian questions annotated with named entities and relations. The number of exact matches has been used as the main evaluation metric. Experiment 1. Goal: evaluate our pattern acquisition procedure. How: results achieved with an increasing number of patterns (from 1 to 25) are compared with those obtained, over the same test set, by: the human - results obtained by using a repository instantiated only with handcrafted

patterns7; the upper bound – results achieved by using only the subset of human patterns that

are contained (i.e. whose words appear) in the training questions, thus being actually learnable by a pattern acquisition component;

the baseline – results achieved by collecting, as relational patterns for each Ri, all the n-grams that always activate Ri. This, of course, guarantees a high correlation between patterns and relation activation, but does not guarantee “minimality” of the acquired patterns (with an expected impact on system’s precision, due to a larger number of spuriously recognized relations).

Dataset: 731 questions. The training set contains 427 questions, some extracted from the QALL-ME Benchmark (143 questions), and some posed by real users in the first QALL-ME focus group (283 questions). The test set contains 304 questions, all posed by real users in the second focus group. Domain: CINEMA Results are reported Figure 2, which plots the percentage of exact matches (Y axis) obtained with an increasing number of patterns (X axis). The four curves refer to: i) our best acquisition procedure (Automatic), ii) the baseline, iii) the human, and iv) the upper bound.

7 It’s worth noting that not all the human patterns have been extracted from the available training data. Especially for demonstration purposes, handcrafted patterns for many domain relations have been “invented” from scratch to maximize variability of the textual material stored in the pattern repository. In principle, the presence of invented patterns should represent the added value given by human intervention.

FP6 IST-033860

Page 12 of 34

Figure 2: automatic pattern acquisition vs. human, upper bound, and baseline results Comments. As expected, our automatically acquired patterns outperform the baseline with almost any amount of patterns (except in the case one pattern per relation is selected). Our best result (88.82% exact matches, obtained with up to 15 patterns per relation) improves the best baseline result (73.68%, with 16 patterns) by more than 20.5%. In terms of questions correctly handled by the system, this difference corresponds to 46 questions (out of 304). More significantly, our best result also outperforms the best score achieved by the “upper bound” (86.51%, with 11 patterns), and the best score achieved by the human (86.84%, with 13 patterns). In general, it’s worth noting that with at least 15 patterns per relation the automatic acquisition procedure always outperform our terms of comparison, even though the difference is small. With up to 25 patterns per relation, when performance becomes quite stable (even though not optimal), the results in terms of questions correctly handled are: 263 questions (automatic), 261 (human), 259 (upper bound), and 204 (baseline). These results demonstrate the effectiveness of our acquisition procedure, especially in capturing regularities neglected by the humans involved in the pattern creation process. Surprisingly, while the expected added value given by human intervention is rather limited (the “human” and “upper bound” lines, in fact, cross each other more than once), such regularities actually make the difference when the number of patterns used increases (i.e. with more than 15 patterns).

FP6 IST-033860

Page 13 of 34

Experiment 2. An inherent limitation of the proposed approach is reflected by the non-monotonic growth of the curves shown in Figure 2, explained by the fact that not necessarily good patterns acquired from the training set guarantee optimal performance on the test set. This impacts on the possibility of estimating the optimal number of patterns that should be learned to maximize performance over the test set (e.g. 15 in the previous experiment). However, while the effect of using more patterns is substantially unpredictable for the baseline, the results obtained by the proposed automatic acquisition approach are more coherent. Our hypothesis is that more refined pattern filtering/selection mechanisms should follow a trend closer to the growth obtained by linguistically motivated human patterns, becoming more stable as the number of pattern increases. If this hypothesis is true, larger numbers of automatically acquired patterns should guarantee acceptable performance (though non necessarily optimal). To check the validity of this hypothesis, a second experiment has been carried out comparing results obtained by two versions of our acquisition procedure (the best one, and a less precise one) with up to 50 of patterns per relation. Goal: obtain insights on how to estimate the optimal number of patterns to be learned. How: verify the stability of system’s performance when large amounts of patterns are used. To analyze possible differences between more and less refined acquisition procedures, two versions of the procedure are compared, namely: i) the best version, with pattern selection, and ii) a less precise one, without pattern selection. Dataset: the same as used for experiment 1. Domain: CINEMA Results are reported in Figure 3, which plots the percentage of exact matches (Y axis), obtained with an increasing number of patterns (X axis), by the two procedures. Comments. As can be seen, even with up to 50 patterns per relation, system’s performance remains substantially stable with both the acquisition procedures. In both the cases, from 25 patterns on the exact matches vary in a range of around 3 percentage points. As regards the results obtained with pattern selection, it’s worth recalling that the slight variations with 40+ patterns are inherently dependent on the adoption of a genetic algorithm, which does not always guarantee the selection of the optimal subset of patterns (as explained in Section 4.2.2). These conclusions confirm that, even though the optimal number of patterns cannot be estimated on the training data, working with larger numbers of automatically acquired patterns does not drastically impact on performance. Concerning the differences, in terms of stability of the results, between more and less refined acquisition procedures, this experiment does not provide any particular evidence. This is probably due to the good overall quality of the acquired patterns, even before the selection process, which indirectly confirms the quality of our heuristic-based filters. However, in terms of strict performance results, the difference between the two acquisition procedures is considerable with almost any amount of patterns, demonstrating the reliability of the pattern selection algorithm.

FP6 IST-033860

Page 14 of 34

Figure 3: System’s performance with patterns repositories of increasing size Experiment 3. As pointed out in Section 4.1, once automatic pattern acquisition procedures are available, the collection of training data (annotated questions) remains as the last knowledge acquisition bottleneck with potential impact on the fast prototyping of QALL-ME applications. Even though the acquisition of domain-specific questions does not represent the focus of our research, a third experiment has been carried out to analyze performance variations with different amounts of training data. Goal: analyze the role of training data in the automatic acquisition process. How: compare performance variations obtained by patterns extracted from training sets of increasing size. To this aim, the training set is randomly split in 10 folds (containing around 40 questions each) incrementally used, at each iteration, to acquire up to 25 patterns per relation. Starting from the first fold (42 questions), new folds are added one at a time until the whole training set is used (i.e. in the last iteration, patterns are extracted from the same training set of 427 questions used in experiment 1). Dataset: the same as used for experiment 1. Domain: CINEMA Results are reported Figure 4, which plots the percentage of exact matches obtained with each training fold (the size of each fold is reported between parentheses).

FP6 IST-033860

Page 15 of 34

Figure 4: automatic pattern acquisition with different amounts of training questions Comments. Though human performance with 25 patterns (85.86% exact matches) is exceeded only when the entire training set is used, the rapid growth of the curve shows that acceptable performance is actually reached even with half of the available training data. More specifically, with 210 training questions (fold 5) the automatically acquired patterns obtain 82.89% exact matches, which correspond to 252 questions correctly handled by the system. Considering these numbers, it’s worth noting that: Using 25 handcrafted patterns per relation, the questions correctly handled by the

system are 261 (just 9 more). This means that, with similar amounts of patterns (max 25 per relation), the distance between the handcrafted ones, and those learned from half of the training set is less than 3% in terms of exact matches.

Using the entire training set for automatic acquisition, the questions correctly handled by the system are 263 (just 11 more). This means that, with similar amounts of patterns, the distance between those learned from half of the training set, and those learned from the entire training set is just 4.3%.

Finally, even though half of the data are enough to achieve acceptable performance, it’s worth noting that the trend of the curve shows a tendency to further performance improvements if more data are available. This is particularly interesting if we keep into consideration that performance improvements with handcrafted patterns are much harder to obtain. In fact, the human-generated patterns (also the “invented” ones) are the result of a work carried out along the entire duration of the project, with the strong impulse of delivering high-performance demonstrations. Considering the many evaluation runs carried out with the resulting material, we have the impression that a performance plateau has been almost reached, that can be hardly exceeded by the human simply with more training data. In contrast with our expectation, at least in our applicable scenario, the conclusion is that the real added value is not represented by

FP6 IST-033860

Page 16 of 34

the human capability of capturing, inventing, or generating all the possible variants of a language expression.

33 . . Acquisition of pAcquisition of patternattern ss from the from the QallQall -- MM e e OntologyOntology In order to address the problem of knowledge acquisition bottleneck, University of Wolverhampton continued the experiments with automatic generation of question patterns in restricted domains starting from a domain ontology. This method produces question patterns, called hypothesis questions, along with their associated query templates (e.g. SPARQL query templates), recognizes the best matched question pattern for a given user question and then takes its corresponding query template for answer retrieval. This section presents the results of this experiment.

33 .. 1. Question pattern generatio1. Question pattern generationn

Analysis of 500 user questions randomly selected from the QALL-ME benchmark revealed that they can be categorized into three topics: Site (62.6%), Event (14.2%) and Occurrence (23.2%). The Site and Event questions contain only static information about a site or event, whereas the Occurrence questions also involve dynamic information about the occurrence of an event in a site and/or at a date/time. Of the 500 questions, 76.8% are static questions and 23.2% are dynamic ones. The static questions can be further categorized into three types: T1 – Query the name of a site or event which has one or more non-name attributes as

the constraint(s); T2 – Query a non-name attribute of a site or event whose name is known; and T3 – Query a non-name attribute of a site or event whose name is unknown but

using its other non-name attribute(s) as the constraint(s).

A T3 question can be treated as the combination of a T1 and a T2 question. For example, a T3 question – “could you give me a contact number for a modern Italian restaurant in Solihull?” – can be decomposed into the following two questions: T1: could you give me the name of a modern Italian restaurant in Solihull? T2: could you give me a contact number for <the name of the restaurant answered

in T1>?

The Occurrence questions are more complicated because they involve four kinds of content – Event, Site, Period and Price. Each kind of content could be the target queried by users, i.e.: Q_Event – Query the event occurring in a site and/or at a date/time; Q_Site – Query the occurring site of an event (at a date/time); Q_Period – Query the date/time of an event occurring in a site; Q_Price – Query the price of an event (occurring in a site and/or at a date/time).

The Q_Event and Q_Site questions can be categorized into T1 if querying the name of the site or event and T3 if querying a non-name attribute of the site or event. In contrast to the T1 Site/Event questions, which use static attributes as the constraints

FP6 IST-033860

Page 17 of 34

(labelled as T1S), a T1 Occurrence question may contain only dynamic constraints (labelled as T1D), or also contain static constraints (labelled as T1D+S). This means that a T1D+S Occurrence question can be decomposed into a T1D Occurrence question and a T1S Site/Event question. The Q_Period and Q_Price questions can be categorized into T2 if the name of the event is known and T3 if the name of the event is unknown. With regard to question structures, T1 or T3 questions on any topic may contain one or more constraints, and the T2 questions on Site/Event contains only one constraint (i.e. the name of the site or event), but the T2 questions on Occurrence may contain more constraints (e.g. occurring site and date/time) in addition to the name of the event. Among these 500 questions, most of them (91%) contain one or two constraints and only a small number (9%) contain more than two constraints.

FP6 IST-033860

Page 18 of 34

Topic Type Example Question T1S Please tell me the name of a Chinese restaurant in Walsall. T2 I want to know the address for the Kinnaree Thai

Restaurant.

Site

T3 Could you give me a contact number for a modern Italian restaurant in Solihull?

T1S What is the name of the film that stars Gerard Butler? T2 How long does the film Norbit last for?

Event T3 When is the new Arnold Schwarzenegger film released?

T1D What is the name of the performance that is on at the Grand Theatre on 26th May till 30th May 2008?

T1D+S Could you tell me the name of an action movie which will be shown at Birmingham Showcase Cinema next week?

Q_Event

T3 How long does the film last at Birmingham Showcase Cinema?

T1D Can you tell me the name of the cinema showing Norbit now?

T1D+S Please tell me the name of the cinema that is showing 300 and has disabled access.

Q_Site

T3 How many seats are there for the cinema showing 300? T2 Please give me the dates that evening worship is being

addressed at Saint Jude's Church? Q_Period

T3 Can you tell the time of the film that will be shown at Birmingham Showcase Cinema next week?

T2 What is the ticket price to see Aladdin at Birmingham AMC Cinema?

Occurrence

Q_Price

T3 What is the ticket price to see the movie that will be shown at 8:00PM at Birmingham Showcase Cinema?

Table 3: Example of different questions

On the basis of the analysis of user questions, patters were automatically produced from the ontology. We decided to produce two types of question patterns (i.e. hypothesis questions): T1 and T2, because T3 questions can be decomposed into a T1 and a T2 question. For the T1 type, the Site/Event hypothesis questions (i.e. T1S) contain one or two constraints, whereas the Occurrence questions contain at most three constraints, which are all dynamic ones (i.e. T1D) since T1D+S questions can be decomposed into a T1D and a T1S question. For the T2 type, the Site/Event hypothesis questions contain only one constraint, whereas the Occurrence ones may contain at most three constraints including the name of the event plus the site and/or date/time constraints. We did not produce hypothesis questions containing many constraints, because: most of the user questions (91%) contain fewer than three constraints, a user question can entail a hypothesis question of the same type containing some

of its constraints and thus can obtain the answers that subsume the exact ones, and it becomes computationally intensive to produce hypothesis questions that contain

more than two constraints.

FP6 IST-033860

Page 19 of 34

In total, 2703 hypothesis questions were generated from the ontology on the tourism domain, including 2339 Site questions, 223 Event questions and 141 Occurrence questions. The actual process of producing question patterns is presented in D4.3.

33 .. 2 Question answering2 Question answering

After the generation of hypothesis questions and query templates, the QA task was reduced to the problem of looking for the hypothesis question that is entailed by a new user question, and then taking its corresponding query template to create a complete SPARQL query for answer retrieval. This is achieved in five steps:

Named and Temporal Entity Annotation Since the hypothesis questions and query templates contain slots, the named entities in the user question are annotated with the ontology properties and classes, and the temporal entities are annotated with TIMEX2, both of them used to fill in these slots. For example, the user question U1 is annotated as follows: <U1>please tell me the name of a <NE type = “Hotel.starRating”>3</NE> star hotel in <NE type =“Destination.name”>Birmingham</NE>.</U1> Topic and Type Classification To facilitate discovering the entailed hypothesis question, the topic of the user question and its type are identified using classifiers that narrow down the search scope. To develop the topic and type classifier, we split the 500 sample user questions into a training set of 300 questions used to construct the classifiers, and a test set of 200 questions to evaluate the accuracy of the constructed classifiers. Each question was manually marked with one of the three topics (i.e. Site, Event, and Occurrence) and then one of the three types (i.e. T1, T2, and T3). All the questions in both the training and test set had been annotated based on the domain ontology. Thus, the named entities in each question were replaced with the properties and classes in the ontology (e.g. Hotel.starRating). After that, each annotated question was tokenized, and stop words were removed from the word tokens. For each unique word token, its document frequency (df) was calculated, and only high-frequency words above a specific df threshold value were retained as the features for building the classifiers. Weka8, a suite of machine learning software, was used to train a topic classifier and a type classifier. After experimenting with three df threshold values (2, 5 and 10) and two classification algorithms (C4.5 and Naïve Bayes), we obtained the best topic classifier (with an accuracy of 95%) with Naïve Bayes and the df threshold value of “2” and the best type classifier (with an accuracy of 85.5%) with Naïve Bayes and the df threshold value of “10”.

Textual Entailment Recognition The textual entailment recognition engines already presented in D5.2 were employed in the process of answering questions.

Answer Retrieval

8 http://www.cs.waikato.ac.nz/ml/weka/

FP6 IST-033860

Page 20 of 34

After the entailed hypothesis question is found, its query template is taken and used to create a real SPARQL query by filling in the slots with the named entities and temporal entities extracted from the user question. Then the SPARQL query is used to retrieve the answers to the user question from the RDF database through Jena.

3.3. 33 Evaluation and results Evaluation and results

The evaluation was based on the 200 user questions in the test set. Each user question was marked manually by assigning to it the machine-generated hypothesis questions entailed by it. If a user question had no hypothesis question, it was marked as “null”. A user question can entail a hypothesis question containing all the same constraints or some of its constraints. For example, the user question U1 entails H1, H2 and H3 to different extents. The extent of the entailment was weighted by dividing the number of the constraints of the hypothesis question by the number of the constraints of the user question. The maximum value of the weight is “1.0”, indicating the hypothesis question exactly matches the user question.

U1: Please tell me the name of a <NE type = “Hotel.starRating”>3</NE> star hotel in <NE type =“Destination.name”>Birmingham</NE H1: what is the name of the hotel which has the star rating [Hotel.starRating] and

is in the destination [Destination.name]? (weight = 1.0) H2: what is the name of the hotel which has a star rating [Hotel.starRating]?

(weight = 0.5) H3: what is the name of the Hotel which is in the destination [Destination.name]?

(weight = 0.5)

For each user question, the entailment engine calculated the confidence score with each of the hypothesis questions in the machine-generated set, and took the one that obtained the highest score as the hypothesis question entailed by the user question. After conducting various tests, it was determined that if the confidence score is below 0.09, the user question is deemed to have no entailed hypothesis question. The output of the entailment engines was compared against the human-generated gold standard, and the entailment accuracy was calculated by summing the weight of the found hypothesis questions and dividing it by the number of user questions. It needs to be pointed out that the entailment accuracy of the gold standard for the 200 user questions is 95.85% rather than 100%. This is because among these questions there are five T1 Site questions with more than two constraints and 16 T1D+S Occurrence questions, which have no exactly matched hypothesis questions but only partially matched ones with the weights below “1.0”. In this evaluation, four experiments were performed: - Experiment 1: for each user question, look for the hypothesis question entailed by it

from the whole machine-generated set; - Experiment 2: for each user question, identify its topic first and look for the

hypothesis question from the subset for its topic; - Experiment 3: for each user question, identify its type first and look for the

hypothesis questions from the subset for its type; - Experiment 4: for each user question, identify its topic and type first and look for the

hypothesis question from the subset for its topic and type.

FP6 IST-033860

Page 21 of 34

The experiment results are reported in Table 4. In Experiment 2, 3 and 4, the entailment accuracy was calculated assuming that all the questions had been correctly classified with an ideal topic classifier (Experiment 2) or an ideal type classifier (Experiment 3) or both (Experiment 4). The final QA accuracy was calculated as the accumulative accuracy by considering the errors of the real topic and type classifier. According to the result of Experiment 1, the semantic entailment engine is able to find the hypothesis questions from the total of 2703 machine-generated ones for most of the user questions (65.0%), but the syntactic engine can only handle less than half of the questions (42.5%). Comparing the results between Experiment 1 & 2 and between Experiment 1 & 3, the entailment accuracy has a substantial increase for both engines. This indicates that if the topic or type of a user question can be correctly identified, this can help the entailment engine to discover the correct hypothesis question for it. In Experiment 2, considering the errors of the applied topic classifier (with an accuracy of 95%), the final QA accuracy still has a moderate increase for both engines (from 42.5% to 44.3%, from 65.0% to 65.6%). However, in Experiment 3, because of the lower accuracy (85.5%) of the applied type classifier, the final QA accuracy has a moderate increase for the syntactic engine (from 42.5% to 46.4%) but decreases for the semantic engine (from 65.0% to 62.6%). This means that type classification is more helpful for the syntactic engine than the semantic engine. It is because that the semantic engine has used question type information more or less by detecting questions’ expected answer types. Comparing the results between Experiment 2 & 4 and between Experiment 3 & 4, the entailment accuracy has a significant increase for both engines. This indicates that if both the topic and type of the user question can be correctly identified, this is more helpful than just one of the classifications. In Experiment 4, even considering the errors of both the applied topic and type classifier, the syntactic engine still obtains the best QA accuracy (47.2%) among the four experiments whereas the semantic engine’s QA accuracy (60.9%) is worse than without any classification (65.0%). It means that question classification is very helpful for a weak entailment engine but less helpful for a powerful engine.

FP6 IST-033860

Page 22 of 34

Entailment Accuracy (%) QA Accuracy (%) Experiment

Topic & Type

Num of User Questions

Num of Hypothesis Questions Syntactic

Engine Semantic Engine

Syntactic Engine

Semantic Engine

1 All 200 2703 42.5 65.0 42.5 65.0 Site 122 2339 Event 30 223

2

Occurrence 48 141

46.6

69.0

44.3

65.6

T1 97 2458 T2 85 245

3

T3 18 0

54.3

73.2

46.4

62.6

T1 57 2194 T2 53 145

Site

T3 12 0 T1 5 184 T2 21 39

Event

T3 4 0 T1 35 80 T2 11 61

4

Occurrence T3 2 0

58.1

75.0

47.2

60.9

Table 4: Results of the experiments9

33 .4.4 ResultsResults and discussion and discussion

The method presented in this section is very appropriate for ontology-based QA in restricted domains because hypothesis questions and query templates can be produced from a domain ontology. Despite its advantages, this method also reveals a weakness that the produced hypothesis questions cannot cover all kinds of user questions. This can become particularly problematic for a large domain ontology, which involves many classes and properties, because it is difficult to exhaust the various combinations of classes and properties and generate all possible hypothesis questions containing many constraints. In this study, we focused on the datatype properties within the distance=2 around the main classes in the ontology to generate hypothesis questions with at most two or three constraints.

The evaluation results revealed that the hypothesis questions produced from the domain ontology can be used to directly (i.e. T1 and T2) or indirectly (i.e. T3) answer all the user questions to a great extent (95.85%) and the developed textual entailment engine was able to find the entailed hypothesis questions for most of the user questions (65.6%). Ideally, the QA accuracy can be improved with the use of both topic and type classification. However, in practice, because of the errors of the simple type classifier, question type classification is not very helpful for a powerful entailment engine that has already had some ability to identify question types, although still helpful for a weak engine. The boundary between powerful and weak engines depends on the accuracy of the applied classifiers. In addition, the generation of a correct query for answer retrieval depends not only on whether the correct hypothesis question and then

9 Note: The entailment accuracy of the gold standard (i.e. the entailed hypothesis questions that were manually identified by humans) is 95.85%. The entailment accuracy for T3 questions is 100% because they have been identified by an ideal type classifier and are thus deemed to have no hypothesis questions.

FP6 IST-033860

Page 23 of 34

the correct query template is selected, but also whether its values are filled with the correct values (i.e. named entities). This means that the correct annotation of named entities is very important in correctly answering a question.

Train set Test set

Number of question pairs 667 667

EDITS (string level) 48 49

EDITS (token level) 67 67

EDITS+ONT (string level) 48 58

EDITS+ONT (token level) 75 80

Semantic engine 78 84

Semantic engine + aligned WordNet 76 84

Table 5: Accuracy of entailment recognition using EDITS and EDITS+ONT, in comparison to our semantic engines (filtering threshold = 20%).

Data set (667 pairs) Before cleaning

After cleaning

EDITS+ONT (token level) 75 76

Semantic engine 78 79

Semantic engine + aligned WordNet 76 77

Table 6: Effect of cleaning of data on the accuracy of EDITS+ONT engine (filtering threshold = 20%).

Number of question pairs 427x2703

EDITS+ONT (token level) 35.40

Semantic engine 51.89

Table 7: Accuracy of EDITS+ONT engine on question pairs with predictive hypothesis (filtering threshold = 9%).

As shown in Table 5, the accuracy of EDITS and EDITS+ONT at token level generates better scores then at the string level, for both training and test sets. When comparing the scores obtained using EDITS and EDITS+ONT at token level only, it is clear that the consideration of the ontology knowledge in the second engine improved the scores, where the entailment accuracy increased from 67% to 75% for the training set and to 80% for the test set. However, the comparison of EDITS+ONT engine and our semantic engine shows an improvement of 3% and 4% respectively for train and test sets, when the latter engine is used. The comparison of EDITS+ONT and our semantic engine on individual entailment decision, especially for the NO/NO pairs, shows that the difference is not significant (87% and 88% respectively), which is

FP6 IST-033860

Page 24 of 34

mainly due to the consideration of filters that exploit the domain ontology to detect negative entailment decisions.

In order to overcome the noise effect in questions that is mainly due to the grammar of spoken questions (e.g. repetition of some words, rephrasing within the same question, miss-spelled words, etc.), some data sets from our benchmark were cleaned by a human annotator and then evaluated. As it is shown in Table 6, the cleaning of data has increased slightly the scores for the EDITS+ONT engine, as well as for our semantic engines, by 1% only.

Our last evaluation of EDITS+ONT engine was based on predictive questions, where 2703 hypothesis questions were automatically generated from our ontology (Ou et al., 2009). The generated questions covered three classes, Event, Site and Occurrence. Event class refers to static information of events, e.g. Movie, Show. Site class corresponds to places providing a kind of service, e.g. Accommodation, Attraction. Finally, Occurrence class refers to an occurrence of an event in a site and/or at a date/time, e.g. a movie show time. 427 distinct user questions were selected from our benchmark in order to entail the generated hypothesis patterns. The evaluation of this data set is presented in Table 7. As shown, our semantic engine is still generating better results where the accuracy was 57.45%, whilst the EDITS+ONT accuracy was 38.64%. To conclude, the evaluation of entailment of questions from various data sets have shown that EDITS+ONT engine is less performing than our semantic engines. Nevertheless, the effectiveness of this new engine might be improved in the future if more advanced token features are considered (e.g. morpho-syntactic features, dependency tree, etc.).

4. 4. AA cquisition cquisition of patterns of patterns from the Webfrom the Web This section reports on the work carried out at the University of Alicante during the third year of the project. The focus of this activity was the development and evaluation of techniques to speed up the process of developing a QALL-ME style QA system, and reduce its development and maintenance costs. To this aim, our work has focused on detecting system bottlenecks and reducing as much as possible the human intervention in the process. The following sections explain in detail how we have reduced the time needed to develop such a system in an estimated percentage close to 66%.

4.1. Designing a QALL4.1. Designing a QALL-- ME style system ME style system

The process of building a QALL-ME style QA system is divided in three main phases. Figure 5 depicts this process:

1. Build an ontology model for the domain 2. Populate the ontology 3. Acquire knowledge from the domain

FP6 IST-033860

Page 25 of 34

Figure 5: Building a QALL-ME style QA system

Build an ontology model for the domain. Ontology-based QA needs a formal representation of the information in the domain. The ontology contains the main relations and classes the system deals with in a specific domain. The QALL-ME project uses an OWL2 ontology designed for the tourism domain (thoroughly in D4.1, D4.2, and D4.3). Populate the ontology. The ontology has to be populated with information about the domain. This information can be obtained either from public data sources, or provided by external data suppliers, or acquired automatically from texts by applying automatic extraction techniques. The data is stored in RDF format according to the ontology model. This RDF data is used as a structured database for extracting answers by means of SPARQL queries. Acquire knowledge from the domain. Being able to serve correctly user information needs depends directly on system ability to understand user queries and transform their input to a specific SPARQL query. For this purpose the system needs to gather domain knowledge enough to understand the different queries users can pose in this domain. In a QALL-ME style system all this information is collected from the analysis of a set of sample questions obtained from real users. The process of building the domain knowledge database involves two steps: The first is to generate a set of representative queries according to the ontology domain. To do this, a representative amount of users of different age, gender and nationality need to be selected to produce queries for the domain. Usually, the ontology, together with a list of real entities extracted from the domain data (e.g. Cinema name, Movie name ...) are shown to them in order to generate queries asking for any kind of data they are interested in. These queries contain real data instances for the concepts defined in the ontology (for example, an instance of “CINEMA” is “Abaco 3D”). This process permits compiling real free form queries for the selected domain. It is worth to say that this question set has to be large enough to include the different expressions users would use to ask the system for information as well as to cover all the classes and relations appearing in the ontology. The second step consists of processing manually these questions to detect and extract the domain knowledge needed to interpret user questions. Expected answer type, named entities and relations appearing in this set of queries are tagged and all these

FP6 IST-033860

Page 26 of 34

elements are associated to entities and relations in the domain ontology. This association will enable a correct question analysis and translation to the SPARQL query the system employs to access the required data. Now, the system is prepared to process and answer user questions in the selected domain.

4.2. System development bottlenecks 4.2. System development bottlenecks

If we want to reduce the time needed for building QALL-ME style system in a specific domain we need to detect the most time consuming processes and apply new strategies or methodologies to shorten the system development cycle. Build an ontology model for the domain. Reducing the time needed for the ontology design looks really difficult. From the initial analysis of the domain until the final design of the ontology, all the activities involve a large amount of manual work which only can be alleviated by reusing or adapting an existing ontology. In any case, we do not consider this step a good candidate for time reduction as the ontology is the core part of the system and its performance depends on a good domain modeling. Populate the ontology. The possibility of reducing the effort in this step is inexistent since the way selected to populate the ontology depends on the type of information sources available for the domain. Acquire knowledge from the domain. This process is the most conflictive and time consuming of all. First, collecting a representative and useful set of user queries is not easy. It requires time and highly motivated people. Second, tagging the collected question set requires a hard manual effort. This effort increases as the ontology grows. And finally, system performance is intimately related to the size and domain coverage of the question sample set obtained from users. We have to ensure this set to be large enough to take profit of machine learning techniques to build effective question interpretation tools (e.g. expected answer type taggers). Furthermore, this set has to cover all the possible classes and relations in the ontology for the system to be able to deal with all the possible questions users may pose to the system. After this analysis, the domain knowledge acquisition phase arises as the perfect candidate for reducing the QALL-ME development cycle. Firstly, it is largely the most time consuming task. We have estimated that it employs around 75% of the whole system development time. And secondly, all the manual processes involved could be substituted by the use of automatic knowledge acquisition techniques. This way we could avoid the need of users and sample questions, as well as all the manual processes involved in the analysis of the question set.

4.34.3 . . Automatic domain knowledge acquisitionAutomatic domain knowledge acquisition

Collecting automatically the domain knowledge we need to build a QALL-ME style QA system will enable to reduce notably the system development and maintenance costs since it is the most time consuming part of the QALL-ME approach. During the third year of the QALL-ME project, we have explored the possibility of acquiring automatically the necessary domain knowledge to avoid the need of sample questions. With this purpose in mind we have developed the following tools:

1. An automatic pattern acquisition algorithm (DC2) 2. A hierarchical language and domain-portable EAT tagger

FP6 IST-033860

Page 27 of 34

3. A domain adaptable Textual Entailment engine An automatic pattern acquisition algorithm. DC2 automatically builds semantic lexicons to characterize classes and relations in free text by acquiring and weighting those textual patterns appearing in the context of the target classes or relations. For instance, the method allows us to characterize the relation Actor-Movie by finding out textual occurrences that often appear referring to a semantic connection between instances of the entities Actor and Movie. These lists of semantically related words are very useful to identify classes and relations both in questions and answers, helping natural language applications to identify such relations in texts. To start the modeling, DC2 needs a small set of seeds containing instances of the entities we want to model. In the context of the QALL-ME project, the DC2 method has been used to automatically obtain the most meaningful semantically-related-words which can be associated to the classes and relations appearing in the QALL-ME ontology. Given the QALL-ME ontology and a small sample of data, DC2 associates each class and relation in the ontology to its respective set of textual patterns. These relations along with their associated patterns make up the domain knowledge the QALL-ME system needs. This process avoids the need of users and guaranties that all entities and relations in the ontology are covered. Figure 6 shows a visual example of the automatic pattern acquisition process. The reader can access a more detailed explanation of this algorithm in the project deliverable 3.3.

FP6 IST-033860

Page 28 of 34

Figure 6: DC2-driven automatic pattern acquisition process

A hierarchical language and domain-portable EAT tagger. We have applied the resources obtained by DC2 algorithm to the question classification task, obtaining very accurate results for both English and Spanish. A full version of this tool can be consulted in the project deliverable D3.3. A domain adaptable Textual Entailment engine. All the manual entries to our Textual Entailment engine have been substituted by the knowledge automatically acquired by DC2 for the domain. This way we make fully automatic the process of tuning the TE engine to the required domain.

4.4. A domain adapta4.4. A domain adaptable Textual Entailment engineble Textual Entailment engine

In the second cycle of the project, the University of Alicante proposed an entailment-based QA core that established lexico-semantic inferences between a new input query and the predefined set of query patterns. This approach follows the textual entailment methodology (Glickman, 2005), since it tackles the implications as one-way meaning relation between two queries, in which the meaning of one must be inferred by the meaning of the other. This approach was already tested in the documents D5.2 and 9.3. However, for the third cycle the aim of the University of Alicante has been to measure

FP6 IST-033860

Page 29 of 34

the impact of the manual steps in the construction of the repository of query patterns, which is the base of all entailment inferences. The main characteristics of the entailment engine are based on the computation of several lexical implications and shallow semantic analyses focused on the knowledge provided by the ontology. Throughput its development, we had endeavored to use as few as possible external resources due to the fact that the approach is going to be released as a web application. The lexical measures are applied over the lemmata belonging to the two snippets (without considering stop-words) and each measure produces a score that shows the similarity degree between the target snippets. These lexical measures comprise a binary matching, the Smith-Waterman algorithm (Smith and Waterman, 1981), the Jaro distance (Jaro, 1995), a matching between consecutive subsequences, the Euclidean distance, the Jaccard coefficient (Jaccard, 1912) and the Wh-term matching. Regarding the inferences based on the ontology knowledge, the entailment engine implements: (i) a ontology concept constraint, which shows when the same ontology concepts appear in both the input query and the candidate entailment query pattern; and (ii) an ontology attribute-based inference, which supported by the different ways users asked for the same ontology attribute in the benchmark is able to give the same relevance to different textual expressions referred to the same ontology attribute. 10 Therefore, although the same entailment engine is being used in the Spanish side of the project, during the third cycle our principal concern has been to avoid any manual step related to the construction of the repository of query patterns, and build it automatically. So, we will be able to evaluate the behavior of our entailment engine when using an automatic-built patterns repository instead of the former repository extracted from the QALL-ME benchmark queries. Note that the efficiency of the entailment component relies on both the inferences performed and the quality of the query patters repository. Consequently, since the patterns within the repository are also associated with the SPARQL query capable of retrieving the answer(s) from the database, the correctness in entailing the appropriate patterns results in retrieving the proper answer, as well. The advances done during the third cycle of the project were driven to substitute any kind of manually obtained knowledge from the domain by only automatically collected knowledge. The DC2 method has been used to obtain the most meaningful semantically-related-words which have been associated to the classes and relations appearing in the QALL-ME ontology. The impact of these changes have been evaluated by embedding the TE engine in a whole system.

4.5. Comparative evaluation4.5. Comparative evaluation

In order to measure the impact of the automatic domain knowledge acquisition tools we performed an evaluation where the original QALL-ME system (with manually

10 Further details about the entailment engine inferences developed by the University of Alicante can be found in the QALL-ME_D5.2_20081115 document.

FP6 IST-033860

Page 30 of 34

acquired knowledge) was compared to a new prototype where all the domain knowledge was collected automatically. This evaluation was only run for Spanish. We considered a subset of the QALL-ME tourism ontology, including concepts and relations of the Cinema sub-domain. The RDF answer database contained cinema and movie information supplied by local data providers. We decided to perform an on-field evaluation. For this purpose we collected 162 questions from real users totally unacquainted with the project (e.g. high school students and administrative assistants amongst others). They were given no specific information about system capabilities. They were only requested to make queries asking for any information about movies or cinemas in Alicante. Table 8 shows the results for this test. The accuracy represents those queries that were well-answered satisfying the users’ requirements.11 As we can observe, the original QALL-ME system for Spanish was able to answer correctly a 80.35% of the questions while the fully automatic system reached a 72.22%.

System configuration Accuracy Original manual system 80.35% Automatic system 72.22%

Table 8: Comparative evaluation results

As shown in the table, when the automatic pattern acquisition is performed the global system accuracy decreases in eight points. These are not bad news. We can enhance that the performance of the automatic system surprisingly reached a level which is comparable to the manual one. These results are really valuable, even more, if we are aware that the time needed to obtain the needed domain knowledge has been only about a 10% of the time employed for building it manually. There are several types of relations where DC2 failed to model it correctly. For instance, relations such as Fax-Cinema are poorly modeled due to the fact that such information usually appears in tables or in a structured form, so it was difficult to extract textual relations between these entities. Subsequent research should be oriented to explore ways to solve these situations.

11 Well-answered queries are those that the users considered as correct, since these answers provided the information solicited.

FP6 IST-033860

Page 31 of 34

5. 5. ReferencesReferences Eberhart, R. C. and Shi, Y., and Kennedy, J. (2001). Swarm Intelligence, Morgan Kaufmann. Glickman, O. (2005). Applied textual entailment. Ph.D. thesis, Bar Ilan University. Gretter, R., Kouylekov, M., and Negri, M. (2008). Dealing with Spoken Requests in a Multilingual Question Answering System. In Proceedings of AIMSA 2008, September 4-6, Varna, Bulgaria. Jaccard, P. (1912). The distribution of the flora in the Alpine Zone. New Phytologist, 11(2), 37–50.

Jaro, M. A. (1995). Probabilistic linkage of large public health data file. Statistics in Medicine, 14, 491–498. Mehdad, Y., Negri, M., Kouylekov, M., Cabrio (2009). EDITS: An Open Source Framework for Recognizing Textual Entailment. To appear in Proceedings of the Text Analysis Conference, TAC’2009 Workshop November 16-17. Mehdad, Y. (2009). Automatic Cost Estimation for Tree Edit Distance Using Particle Swarm Optimization. In Proceedings of the ACL-IJCNLP 2009, 2-7 August, Singapore. Negri, M., Kouylekov, M., Magnini, B., Cabrio, E., Mehdad, Y. (2009). Towards Extensible Textual Entailment Engines: the EDITS package. To appear in Proceedings of AI*IA‘2009, December 9-12, Reggio Emilia, Italy. Negri M., and Kouylekov, M. (2009). Question Answering over Structured Data: an Entailment-Based Approach to Question Analysis. In Proceedings of RANLP‘2009, September 14-16, Borovets, Bulgaria. Ou S., Mekhaldi D., Orasan C. (2009), An Ontology-based Question Answering Method with the Use of Textual Entailment. To appear in Proceedings of NLPKE’2009, the IEEE International Conference on Natural Language Processing and Knowledge Engineering, Dalian, China, Sep. 24-27. Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147, 195–197. Wang, R. and Neumann, G. (2007). Recognizing Textual Entailment Using a Subsequence Kernel Method. In Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence (AAAI-07), July 22-26, Vancouver, Canada. Wang, R. and Neumann, G. (2008). Information Synthesis for Answer Validation. In: Carol Peters et al. (ed.): CLEF 2008 Working Notes, Aarhus, Denmark, Springer Verlag, 2008.

FP6 IST-033860

Page 32 of 34

Wang, R. and Neumann, G. (2009). An Accuracy-Oriented Divide-and-Conquer Strategy for Recognizing Textual Entailment. In Proceedings of the Text Analysis Conference (TAC 2008) Workshop - RTE-4 Track, Gaithersburg, Maryland, USA. Wang, R. and Zhang, Y. (2009). Recognizing Textual Relatedness with Predicate-Argument Structures. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), Singapore, Singapore. Versley, Y., Ponzetto, S.P., Poesio, M., Eidelman, V., Jern, A., Smith, J., Yang, X., Moschitti, A. (2008). BART: A Modular Toolkit for Coreference Resolution. ACL 2008 System demo.

qall-me – pattern acquisition for question...

Documents