exploring models for semantic category verification

13

Click here to load reader

Upload: dmitri-roussinov

Post on 26-Jun-2016

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Exploring models for semantic category verification

ARTICLE IN PRESS

Contents lists available at ScienceDirect

Information Systems

Information Systems 34 (2009) 753–765

0306-43

doi:10.1

� Cor

E-m

turetke1 Te

journal homepage: www.elsevier.com/locate/infosys

Exploring models for semantic category verification

Dmitri Roussinov a,�, Ozgur Turetken b,1

a Department of Computer and Information Sciences, University of Strathclyde, L13.29 Livingstone Tower, 16 Richmond Street, Glasgow G1 1XQ, United Kingdomb Institute of Innovation and Technology Management, Ted Rogers School of Information Technology Management, Ryerson University, 575 Bay Street, Toronto,

Ont., Canada M5G 2C5

a r t i c l e i n f o

Keywords:

Semantic category verification

Automated question answering

Text mining

World wide web

Search engines

79/$ - see front matter & 2009 Elsevier B.V. A

016/j.is.2009.03.007

responding author. Tel.: +44141548 3706.

ail addresses: [email protected].

[email protected] (O. Turetken).

l.: +1416979 5000x2481.

a b s t r a c t

Many artificial intelligence tasks, such as automated question answering, reasoning, or

heterogeneous database integration, involve verification of a semantic category (e.g.

‘‘coffee’’ is a drink, ‘‘red’’ is a color, while ‘‘steak’’ is not a drink and ‘‘big’’ is not a color).

In this research, we explore completely automated on-the-fly verification of a

membership in any arbitrary category which has not been expected a priori. Our

approach does not rely on any manually codified knowledge (such as WordNet or

Wikipedia) but instead capitalizes on the diversity of topics and word usage on the

World Wide Web, thus can be considered ‘‘knowledge-light’’ and complementary to the

‘‘knowledge-intensive’’ approaches. We have created a quantitative verification model

and established (1) what specific variables are important and (2) what ranges and upper

limits of accuracy are attainable. While our semantic verification algorithm is entirely

self-contained (not involving any previously reported components that are beyond the

scope of this paper), we have tested it empirically within our fact seeking engine on the

well known TREC conference test questions. Due to our implementation of semantic

verification, the answer accuracy has improved by up to 16% depending on the specific

models and metrics used.

& 2009 Elsevier B.V. All rights reserved.

1. Introduction

Semantic verification is the task of automated verifica-tion of the membership in an arbitrary (not pre-antici-pated) category, e.g. red is a color, coffee is a drink, but red

is not a drink. While the problems arise in many domains,here we specifically explore its applications to online factseeking, which is sometimes referred as open-corpus/open-domain question answering. Our approach builds onmassive pattern matching which we believe modelshuman linguistic practice of digesting evidence forcategorical membership during a lifetime of learningprocess.

ll rights reserved.

uk (D. Roussinov),

The goal of question answering is to locate, extract, andrepresent a specific answer to a user question expressed innatural language. Answers to many natural languagequestions (e.g. What color is the sky?) are expected tobelong to a certain semantic category (e.g. color such asblue, red, purple, etc.), Those questions prove to berelatively difficult for current systems since the correctanswer is not guaranteed to be found in an explicit formsuch as in the sentence The color of the sky is blue, butrather may need to be extracted from a sentenceanswering it implicitly, such as I saw a vast blue sky above

me, in which a wrong answer ‘‘vast’’ has grammaticallythe same role as the correct answer ‘‘blue’’, and representsa property of the sky. However, vast refers to size, while weare looking for a color.

The currently popular approach to solving this ‘‘se-mantic’’ matching problem is through developing anextensive taxonomy of possible semantic categories [20].This requires the anticipation of all possible questions,and hence substantial manual effort. Moreover, this

Page 2: Exploring models for semantic category verification

ARTICLE IN PRESS

D. Roussinov, O. Turetken / Information Systems 34 (2009) 753–765754

approach poses significant limitations, and although itworks relatively well with more common categories(cities, countries, organizations, writers, musicians), it doesnot handle at least the following types of categories: (1)Rare categories: e.g. What was the name of the first Russian

astronaut to do a spacewalk? What American revolutionary

general turned over West Point to the British? (2) Categories

involving logic: e.g. the question What cities in Eastern

Germany have been bombed during World War II? involves acategory defined as logical conjunction of being a city andbeing located in Eastern Germany. (3) Vague categories:e.g. the question What industry is Rohm and Haas in?

involves a category industry, for which a simple Googlesearch lists several definitions including ‘‘broad term foreconomic activity’’, a ‘‘sector’’, ‘‘people or companiesengaged in a particular kind of commercial enterprise’’,and ‘‘businesses concerned with goods as opposed toservices.’’

In this paper, we explore completely automated on-the-fly verification of a membership in a previouslyunanticipated category. Although, our algorithm can beused inside any other system, we have implemented andempirically evaluated it within our fact seeking engine,which has been available in a demo version online [42].Our inspection of the 1000+ search sessions recorded byour demo reveals that approximately 20% of questionsprocessed by the system have answers that are expectedto belong to a specific semantic category, thus suchsystems can certainly benefit from semantic verification.The performance of our system was evaluated earlier[35,36] and found to be comparable with the other state-of-the-art systems (e.g. [14,22]) that are based onredundancy, rather than on extensive manually codifiedknowledge such as elaborate ontologies or rules for deepparsing. Contrary to the ‘‘knowledge-heavy’’ commercialsystems, our system is entirely transparent: all theinvolved algorithms are described in prior publications,and thus, can be replicated by other researchers, which webelieve makes this work superior to those reported on the‘‘closed’’ (impossible to replicate) systems.

Through the work reported in this paper, we improvethe semantic verification component of our system bymoving beyond pure heuristics, and by building a modelbased on a logistic regression. Our hypotheses focus on (1)what variables contribute to the accuracy of answers, (2)what normalizing transformations are beneficial, and (3)if the improvements due to category verification arestatistically and practically significant.

2. Literature review

The problems of automated verification of the mem-bership in an arbitrary (not pre-anticipated) category existin many domains including (1) Automated Question

Answering: For example, the correct answer to thequestion What soft drink has most caffeine? should belongto the category ‘‘soft drink.’’ (2) Database federation, wherethe automated integration of several heterogeneousdatabases requires matching an attribute in one database(e.g. having such values as red, green, and purple) to an

attribute (e.g. color) in another database. (3) Automated

reasoning, where the rules are propagated to all thesubclasses of the superclass. (4) Spellchecking or oddity

detection [17], where the substitution of a word with itshypernym (superclass) or hyponym (subclass) is consid-ered legitimate while many other types of substitutionsare not.

2.1. QA technology

The National Institute of Standards (NIST) has beenorganizing the annual Text Retrieval Conference (TREC)[39,40] since 1992, in which researchers and commercialcompanies compete in document retrieval and questionanswering tasks. The participating systems have toidentify exact answers to so-called factual questions (orfactoids) such as who, when, where, what, etc., listquestions (What companies manufacture rod hockey

games?), and definitions (What is bulimia?). In order toanswer these questions, a typical participating systemwould: (a) transform the user query into a form it can useto search for relevant documents (web pages), (b) identifythe relevant passages within the retrieved documents thatmay provide the answer to the question, and (c) identifythe most promising candidate answers from the relevantpassages. Most of the systems are designed based ontechniques from natural language processing, informationretrieval, and computational linguistics. For example,Falcon [20], one of the most successful systems, is basedon a pre-built hierarchy of dozens of semantic types ofexpected answers (person, place, profession, date, etc.),complete syntactic parsing of all potential answer sources,and automated theorem proving to validate the answers.

In contrast to the natural language processing-basedapproaches, ‘‘shallow’’ approaches that use only simplepattern matching have recently been tried with good levelof success. For example, the system from InsightSoft [38]won the 1st place in 2002 and the 2nd place in 2001 TRECcompetitions. The ‘‘knowledge-light’’ systems based onsimple pattern matching and redundancy (repetitions ofthe answer on the Web), such as [14], also scoredcomparably.

Both NLP-based approaches and those that requireelaborate manually created patterns have a strongadvantage: they can be applied to smaller collections(e.g. corporate repositories) and still provide good perfor-mance. However, none of the known top performingsystems has been made publicly open to the otherresearches for follow up investigations because of theexpensive knowledge engineering required to build suchsystems and the related intellectual property issues. Asresult, it is still not known what components of thesesystems are crucial for their success, and how well theirapproaches would perform outside of the TREC test sets.

Meanwhile, the algorithms behind some of the systemsthat do not require extensive knowledge engineering,but still demonstrate reasonable performance, havebeen made freely available to public. Therefore replicationof these systems and independent testing by otherresearchers is possible. We believe that from a research

Page 3: Exploring models for semantic category verification

ARTICLE IN PRESS

D. Roussinov, O. Turetken / Information Systems 34 (2009) 753–765 755

perspective, these non-commercial systems and theapproaches behind them are by no means less interestingthan the top commercial systems. Some of those systemsare reviewed in the next section.

2.2. Web QA

Prior studies [31] have indicated that the currentWWW search engines, especially those with very largeindexes like Google, offer a very promising source for opendomain question answering. Roussinov et al. [36] pre-sented a recent review of the trends and capabilities ofmodern Web question answering systems and concludedthat they give more fact seeking power to the user thanthe popular keyword driven search portals like Google andMSN. There are several important distinctions betweenautomated question answering in a closed corpus and thatwithin the entire Web:

(1)

Typically, the Web has a much larger variety in theanswers that it presents. This allows Web-based factseeking systems to look for answers in the simplestforms (e.g. Sky is blue), which generally makes the taskeasier.

(2)

The users of Web fact seeking engines do notnecessarily need the answers to be extracted precisely.In fact, we personally observed from the interactionwith practitioners that they prefer to read the answerwithin its context (e.g. snippets displayed by keywordsearch engines such as Google) to verify that it is notspurious.

(3)

Web fact seeking engines need to be quick, while TRECcompetition does not force systems to provideanswers in real time. This makes it important forWeb-based systems to emphasize simple and compu-tationally efficient algorithms and implementationssuch as simple pattern matching as opposed to ‘‘deep’’linguistic analysis.

Several systems that are based on these generalprinciples have been developed since the early 1990s.START [25] was one of the first systems available onlinesince 1993. It was primarily focused on encyclopedicquestions (e.g. about geography) and used a precompiledknowledge base. Prior evaluation of START [25] indicatedthat its knowledge is rather limited, e.g., it fails at manyquestions from the standard test sets (detailed below).

Mulder [26] was the first general-purpose, fully-automated system available on the web. It worked byquerying a general purpose search engine (Google orMSN), then retrieving and analyzing the web pagesreturned from the queries to select answers. Whenevaluated by its designers on TREC questions, Mulderoutperformed AskJeeves and Google by a large margin.Unfortunately, Mulder is no longer available on the Webas a demo or for download for a comparative evaluation.

In their prototype called Tritus, Agichtein et al. [1]introduced a method for learning query transformationsthat improves the ability to use web search engines toanswer questions. Blind evaluation of this system on a set

of real queries from a web search engine log showed thatthe method significantly outperformed the underlyingweb search engines as well as a commercial search enginespecializing in question answering. Tritus is also notavailable online.

A relatively complete, general-purpose, web-basedsystem, called NSIR, was presented in [32]. Dumais et al.[14] presented another open-domain Web system thatapplies simple combinatorial permutations of words (socalled ‘‘re-writes’’) to the snippets returned by Google anda set of 15 handcrafted rules (semantic filters). Theirsystem achieved a remarkable accuracy level on the TRECtest set: Mean Reciprocal Rank (MRR) of 0.507, which canbe roughly interpreted as ‘‘on the average’’ the correctanswers being the second answers found by the system.This was only 20–30% below of the accuracy of the bestknowledge-heavy systems and still within top 20%participating systems. The authors’ experiments alsoindicated that semantic filtering was the component thathad the greatest impact on the overall performance of thesystem, even though it was limited to only a fewcategories (people, places, dates, numbers, etc.). Similarobservations about the importance of the semanticverification of the candidate answers have been reportedbased on the best known knowledge-heavy systems:Moldovan et al. [29] and Harabagiu et al. [19] used acombination of heuristics, ontological knowledge fromWordNet and machine learning to distinguish 38 pre-anticipated possible semantic categories. They also notedthat the category frequently does happen to be identifi-able by common Noun Entity (NE) recognizers.

2.3. Data driven semantic verification

To avoid laborious creation of large knowledge bases(e.g. ontologies), researchers have been actively tryingautomated or semi-automated data driven semanticverification techniques. The idea to count the number ofmatches to certain simple patterns (e.g. colors such as red,

blue or green; soft-drinks including Pepsi and Sprite, etc.) inorder to automatically discover hyponym relationships istypically attributed to Hearst [21] and was tested onGrolier’s American Academic Encyclopedia using WordNetas gold standard. Variations of the idea of Hearst’spatterns have been adopted by other researchers: inspecific domains [3], and for anaphora resolution [30], fordiscovery of part-of [8] and causation relations [18].

These approaches are known for a relatively high(50%+) precision, but a very low recall due to the fact thatthe occurrence of patterns in a closed corpora are typicallyrare. To overcome this data sparseness problem, research-ers resorted to the World Wide Web: Hearst patterns aresearched for using the Google API for the purpose ofanaphoric resolution in [28], enriching a given ontology in[2]. Earlier work by S. Brin (one of the founders of Googlecorporation) also followed along those lines. In [7], hepresented a bootstrapping approach in which the systemstarts with a few patterns, and then tries to induce newpatterns using the results of the application of the seedpatterns as training dataset. It was also the general idea

Page 4: Exploring models for semantic category verification

ARTICLE IN PRESS

D. Roussinov, O. Turetken / Information Systems 34 (2009) 753–765756

underlying the Armadillo system [11], which exploitedredundancy in the World Wide Web to induce suchextraction rules.

Cimiano et al. [9,10] used the Google API to matchHearst-like patterns on the Web in order to find the bestconcept for an unknown instance, and for finding theappropriate superconcept for a certain concept in a givenontology. The SemTag system [12] automatically anno-tates web pages by disambiguating entities appearing in aweb page while relying on the TAP lexicon to provide allthe possible concepts, senses or meanings of a givenentity. The systems participating in the Message Under-standing Conferences (MUC) achieved accuracies of wellabove 90% in the task of tagging named entities withrespect to their class labels, but the latter being limited tothree classes: organization, person and location. The worksby Alfonseca and Manandhar [4] also addressed theproblem of assigning the correct ontological class tounknown words by building on the distributional hypoth-

esis, i.e. that words are similar to the extent to which theyshare linguistic contexts. They adopted a vector-spacemodel and exploited verb/object dependencies as fea-tures.

Somewhat similar to the methods and the purpose ofthe research described here are the approaches within theKnowItAll project [13,15], which automatically collectsthousands of relationships, including the hyponymic (‘‘is-a’’) ones, from the Web. Within KnowItAll, Etzioni et al.developed a probabilistic model by building on the classicballs-and-urns problem from combinatorics. They estab-lished that the urns model outperforms those based onpointwise mutual information metrics used earlier withinprior KnowItAll studies [13,15] or by other researchers.However, the model provides the estimate of the prob-ability of the categorical membership only in the case ofsupervised learning (anticipated and manually labeledcategories). It provides only rank-ordering in the unsuper-

vised case (not a pre-anticipated category). The rankordering was evaluated by recall and precision, and onlyon four relations: Corporations, Countries, CEO of acompany, and Capital of a Country while for the purposeof our study, i.e. question answering, a much largernumber of categories is typically encountered. Schlobachet al. [37] studied semantic verification for a largernumber of categories, but these categories were limitedto the geography domain. They also used knowledgeintensive methods in addition to pattern matchingstatistics.

Fleischmann and Hovy [16] classified named entitiesinto fine-grained categories. They used 8 classes for aperson: athlete, politician/government, clergy, businessper-

son, entertainer/artist, lawyer, doctor/scientist, police, exam-ined 5 different Machine Learning algorithms (C4.5, afeed-forward neural network, k-nearest Neighbors, aSupport Vector Machine and a Naıve Bayes classifier)and reported the best accuracy as 70.4%.

To summarize, while using pattern matching datadriven approaches have been suggested and studiedearlier for the tasks related to semantic category verifica-tion, none of the prior methods provided a probabilistic

verification of a membership in a previously non-anticipated

semantic category, which is necessary for question answering

and is the focus of investigation here. Previous research hasalso not empirically studied what variables and transfor-

mations are important in the mathematical models.

3. Question answering technology involved in the study

This section briefly overviews the principles behind theQuestion Answering system that we used for the studyreported in this paper. Although more details can be foundin earlier publications [35,36], they are not essential forthe replication of the study here. The full details of oursemantic verification algorithm (the focus of this study)are presented in the next section.

The general idea behind the approach we took here(like those behind many others mentioned in the Section2) is to apply pattern matching and take advantage of theredundancy (repeating of the same information) on theweb. For example, the answer to the question ‘‘What is the

capital of Taiwan?’’ can be found in the sentence ‘‘The

capital of Taiwan is Taipei’’, which matches a pattern \Q is

\A, where \Q is the question part (‘‘The capital of Taiwan’’)and \A ¼ ‘‘Taipei’’ is the text that forms a candidate answer.To our knowledge, the trainable pattern approach was firstintroduced in Ravichandran and Hovy [33].

3.1. Pattern training

We automatically create and train up to 50 patterns foreach question type (such as what is, what was, where is,how far, how many, how big, etc.), based on a training dataset consisting of question–answer pairs, e.g. those avail-able from past TREC conferences [39,40]. Through train-ing, each pattern is assigned a probability estimate thatthe entire matching text contains the correct answer. Thisprobability estimate is used in the triangulation (confirm-ing/disconfirming) process that re-ranks candidate an-swers. \A, \Q, \T, \p (punctuation mark), \s (sentencebeginning), \V (verb) and * (a wildcard that matches anywords) are the only special symbols used in our patternlanguage so far, but it is easily extensible.

Answering the question ‘‘In which city is Eiffel Tower

located?’’ illustrates our question answering process stepby step.

3.2. Type identification

The question itself matches the pattern in which \T is \Q

\V, where \T ¼ ‘‘city’’ is the semantic category of theexpected answer, \Q ¼ ‘‘Eiffel Tower’’ is the so calledquestion part (sometimes referred as ‘‘target’’ or ‘‘focus’’),and \V ¼ ‘‘located’’ is a verb. In order to correctly identifythe question type and its components, the systememploys the Brill tagger [5], a freely available part ofSpeech tagger. We apply the tagger to the questions only,but not to the answer sources. Since in this study we onlylooked at the questions that have a certain expectedsemantic category (and not one of the following: person,place, date, number) for an answer, the details of the typeidentification in a general case are not essential and

Page 5: Exploring models for semantic category verification

ARTICLE IN PRESS

D. Roussinov, O. Turetken / Information Systems 34 (2009) 753–765 757

skipped here. The target semantic category was identifiedby applying 4 simple regular expressions to the questions.For example, the regular expression ‘‘(What|Which) (.+)(do|does|did|is|was|are|were)’’ would match a question‘‘What tourist attractions are there in Paris?’’ and identify‘‘tourist attraction’’ as a semantic category.

3.3. Query modulation and answer matching

Query modulation converts each answer pattern (e.g.,\Q is \V in \A) into a query for a general purpose searchengine. For example, the Google query for the above-mentioned question would be +‘‘Eiffel Tower is located in’’.

The sentence ‘‘Eiffel Tower is located in the centre of

Paris, the capital of France’’ would result in a match andcreate a candidate answer ‘‘the center of Paris, the capital of

France’’ with a corresponding probability estimate ofcontaining a correct answer (i.e. 0.5) obtained previouslyfor each pattern by training.

3.4. Answer detailing

Answer Detailing produces more candidate answers byforming sub-phrases from the original candidate answersthat: (1) do not exceed three words (not counting ‘‘stopwords’’ such as a, the, in, on) (2) do not cross punctuationmarks, and (3) do not start or end with stop words. In ourexample, these would be: center, Paris, capital, France,center of Paris, capital of France. Each detailed candidateanswer is assigned a probability estimate of being correctbased on its proportion (word length) in the matchingtext. This follows a simple assumption that the correctanswer can happen anywhere with the same likelihood(uniform distribution) within the matching text.

If there are fewer than expected candidate answersfound (e.g. 200), the algorithm resorts to a ‘‘fall back’’approach as in [14]; it creates candidate answers fromeach sentence of the snippets returned by the searchengine (Google), and applies answer detailing to thoseanswers, while assigning the probability estimate ofcontaining the correct answer to 0.01 (manually set basedon prior estimates). If there are still not enough candi-dates, the system automatically relaxes the modulatedquery by removing the most frequent words first untilenough pages are returned by the engine. Thus, thismethod reduces a question such as ‘‘Who still makes rod

hockey games?’’ to ‘‘Who makes rod hockey?.’’ The priorwork has shown that without involving any patternmatching and using solely the ‘‘fall back’’ approach, oursystem as well as other similar question answeringsystems suffer only 10% relative drop in the accuracy[14,35,36]. Therefore, we do not believe using answerpatterns was essential in our study and it can be safelyskipped if our results are to be replicated while keepingthe implementation complexity at a minimum. In thatcase, all the candidate answers can be easily obtainedfrom the snippets returned by the underlying searchengine (e.g. Google) with response to the questionsubmitted verbatim.

3.5. Semantic adjustment

Our underlying question answering system does notuse an extensive hierarchy of question types organized bythe expected semantic category of an answer (e.g. person,celebrity, organization, city, state, etc.). Rather, it uses asmall set of independently considered semantic boolean(yes/no) features of candidate answers: is number, is date,is year, is location, is person name and is proper name. Thosefeatures are detected by matching simple hand-craftedregular expressions. Depending on the expected semanticcategory of an answer, the presence or absence of thosefeatures results in applying any of approximately 20manually tuned discounts. Knowing the entire set of rulesand their discount factors is not essential for the studyreported here, because we observed that semanticadjustment only effectively removed the candidate an-swers that are numbers, dates, locations or people names,while leaving other candidate answers to be processed bythe semantic verification, the focus of the current paper.

3.6. Triangulation

The candidate answers are triangulated (confirmed ordisconfirmed) against each other and then re-orderedaccording to their final score. In essence, our triangulationalgorithm promotes those candidate answers that arerepeated the most. It eventually assigns a score sc in the[0,1] range which can be informally interpreted as theestimate of the probability of the candidate answer beingcorrect. These probabilities are independent in the sensethat they do not necessarily add up to 1 across all thecandidate answers, which we believe does not pose aproblem since multiple correct answers are indeed oftenpossible.

We used the ‘‘urns-and-balls’’ triangulation model inour study. It assumes that every pattern match (candidateanswer) is an independent event and thus the finalprobability (score) can be estimated as

sc ¼ 1�Y

i

ð1� piÞ, (1)

where the product goes over each different source i of thesame candidate answer c, and pi is its estimatedprobability of being correct based solely on that source.(1�pi) represents the estimated probability of beingwrong. In order for this model (estimate) to be accurate,the sources have to be as independent as possible, whichcan be accomplished by various heuristics. In particular,we considered the same web pages and the samesentences longer than 10 words as duplicate sourcesand, thus, counted them only once in formula (1). It is easyto see that formula (1) produces the probability estimatein the [0,1] range and the estimate quickly approaches 1when multiple matches are found with substantialprobabilities, e.g. ten matches each with 0.2 score wouldresult in 0.90 probability estimate of the candidate answerbeing correct. Please note, that this score reflects onlygrammatical evidence of being correct which is collectedthrough candidates matching specific answer patterns orby their occurrence in vicinity of the question words (‘‘fall

Page 6: Exploring models for semantic category verification

ARTICLE IN PRESS

D. Roussinov, O. Turetken / Information Systems 34 (2009) 753–765758

back’’ mode). This ‘‘grammatical’’ evidence still needs tobe combined with semantic verification (the probabilityestimate of belonging to the expected semantic category,e.g. color or city) which is the focus of our current workand is detailed in the next section.

4. Automated semantic verification

4.1. The intuition behind our approach

The intuition behind our approach to semantic ver-ification is as follows. If we need to answer a questionWhat soft drink contains the largest amount of caffeine?

then we expect a candidate answer (e.g. Pepsi) to belong toa specific category (soft drink). This implies that one of thepossible answers to the question What is Pepsi? should besoft drink. We can ask this question automatically, performthe same necessary steps as outlined above, and check thecertain number of top answers to see if they contain thedesired category. Thus, in this paper, the automatedquestion answering on the web is not only the applicationof semantic verification but it also serves as thetechnology behind it.

We deliberately simplified the question answeringcycle inside our system when it is used for semanticverification in order to keep the process quick and thealgorithms easier to replicate. The semantic verificationalgorithm that is described here can be easily replicatedwithin any other system by simply querying a commercialweb search engine (e.g. Google) in a way explained below.Thus, it was not necessary for us to share software code inorder to allow replication of our study reported here.

We selected a total of 16 most ‘‘efficient’’ patterns fromthose automatically identified and trained for ‘‘what is’’type of questions following the following procedure. First,we selected the most potent patterns, those that providedthe largest number of correct matches when querying theunderlying search engine. Using the most accurate

patterns is not the most efficient solution since some ofthe patterns rarely produce any matches. In order to makepatterns more general and potentially produce morematches to analyze, for each pattern containing an article‘a’ we added the corresponding pattern with the article‘an’. Similarly, for each pattern based on the auxiliary verb‘is’ (present tense), we added the corresponding patternwith its past tense form ‘was.’ The entire set of patternsformed as described above is listed in Table 1.

During semantic verification, instead of the full ques-tion answering process outlined above, for each candidateanswer, our system only queries the underlying websearch engine for the total number of pages that matchthe modulated query for each pattern listed in Table 1. Forexample, ‘‘pepsi is a soft drink’’, matches 127 pages from

Table 1The patterns used for semantic verification.

\Q of \A \Q is a kind of a \A

\Q is a \A \Q is a type of a \A

\Q is an \A \Q is a kind of an \A

\Q is an example of a \A \Q is a type of an \A

Google; ‘‘pepsi is a type of a soft drink’’ matches 0, etc. Wedenote the corresponding number of matches below asm1ym16, and the aggregate total number of matches isdenoted as M.

4.2. The role of semantic verification in the question

answering process

The purpose of semantic verification is to estimate theprobability of the given candidate answer to belong to thespecified category. For example, since Pepsi is a soft drink,that probability estimate should be close to 1.0. In someless clear cases (e.g. mosquito is an animal) it can be lessthan 1.0. An informal interpretation of this probabilityestimate may be as follows: if we ask 100 people thequestion ‘‘Is mosquito an animal?’’ and only 30 of themwould answer ‘‘yes’’ then that probability estimate wouldbe 0.3. Since our approach draws on the aggregatedopinions of all of the Web contributors, it may producecategory memberships that are considered wrong accord-ing to formal taxonomies, but are still believed to be truedue to common myths, misconceptions, or even jokes,which are frequently circulating on the Web. For example,a phrase ‘‘coffee is a soft drink’’ can be found in 7 pages,although, formally, it is not true. That is why we believethat it is important to aggregate across multiple patternmatches and sources. We also believe it is important toconsider the overall statistics of both the category and thecandidate answer. For example, the total numbers of webpages that contain both. Obviously, the more commonlyoccurring candidate answers have higher number ofpattern matches just because of their frequent occurrencein general and not necessarily because of high likelihoodof being a correct answer. Thus, discounting number ofpattern matches for frequent candidate answers should bepresent in an efficient model.

We combine all these variables into one model thatestimates the probability of membership in a given semanticcategory. If we denote this probability estimate (score) as s,then each candidate answer initial score (obtained after allthe processing steps described in the previous section) ismultiplicatively adjusted as typical for the current state ofthe art question answering systems [39,40].

4.3. Building the predictive model

Based on our prior experience with semantic verifica-tion and its application to automated question answering,we state the properties highly desired for the quantitativemodel:

(1)

\Q w

\Q w

\Q w

\Q w

Use of global statistics: The model should be capable ofcombining the individual numbers of specific pattern

as a \A \Q was a type of a \A

as an \A \Q was an example of an \A

as an example of a \A \Q was a kind of an \A

as a kind of a \A \Q was a type of an \A

Page 7: Exploring models for semantic category verification

ARTICLE IN PRESS

D. Roussinov, O. Turetken / Information Systems 34 (2009) 753–765 759

matches and the aggregate statistics of the candidateanswers and the category itself.

(2)

Weighting of patterns: Some patterns may be morereliable indicators of the semantic membership (e.g.‘‘X is a type of Y’’) than others (e.g. ‘‘X of Y’’).

(3)

Gracious handling of zero matches: Since some candi-dates are rare words or phrases, their matches to theverification patterns may not be reliably detected inany corpus of a limited size, even a very large one suchas the World Wide Web.

(4)

Uniformity: The same formulas and parameters areapplied across all categories.

While noting that to our knowledge, the properties 1, 2,and 4 have been neither incorporated to nor combinedwith the models suggested by prior research, we havedeveloped a number of predictive models using binarylogistic regression techniques to estimate the probabilityof membership s. These models simply used differentcombinations of the following variables that, as weintuitively conjectured, may possess the desirable proper-ties stated above:

m1, m2,y, m16: match for each of the patterns,M ¼ m1+m2+?+m16: total number of pattern matches,DFA: number of pages on the web containing thecandidate answer (e.g. Pepsi),DFC: number of pages on the web containing theexpected category (e.g. soft drink),DFAC: number of web pages containing both thecandidate answer (e.g. Pepsi) and the expected cate-gory (e.g. soft drink).

The intuition behind this selection of variables was thatmore frequent candidate answers would result in largernumbers of pattern matches simply due to chance(random co-occurrence), thus we would expect s to be adecreasing function of DFA. Our model also takes intoconsideration that some pairs of words frequently co-occur even if one of them is not a category of another (e.g.cup of coffee) would match one of our patterns even whenno underlying categorical relationship exists, while ‘‘cityof Rome’’ would be a legitimate indication that Rome is acity. For this reason, we introduced DFAC into the model.Finally, the model may benefit from DFC as a normalizingvariable for the m’s and DFAC.

Based on the observation that the distributions of thefrequencies of occurrences (DFA, DFC) and co-occurrences(DFAC) typically look more uniform in the log space, wealso explored the models that use natural logs of thevariables M, DFA, DFC, and DFAC. Therefore, the completeset of variables studied in our models was the following:

(m1, m2,y, m16, M, DFA, DFC, DFAC, log(M+1), log(DFA),log(DFC), log(DFAC)).

Since M was equal to 0 for many data points, we added 1to M before taking its logarithm so it would be alwaysdefined. The candidates with DFAC ¼ 0 were assigneds ¼ 0.

Although this is not an exhaustive combination of thevariables above, in this study, we considered the followingmodels:

Model 0: s ¼ const, or simply no semantic verification.This was the baseline where every other model iscompared against.Model 1: s ¼ s(m1, m2,y, m16, DFA, DFC, DFAC), alloriginal variables in their non-transformed form.Model 2: s ¼ s(log(M+1), log(DFA), log(DFC), log(DFAC)),all variables are log-transformed, aggregate variable M

is used instead of separate m’s. This model subsumesthe metrics based on pointwise mutual informationstudied earlier for other tasks [13,15].Model 3: s ¼ s(m1, m2,y, m16, log(DFA), log(DFC), log(D-

FAC)), only the document frequencies are log-trans-formed (since they are typically much larger than m’s;m’s are not log transformed and used independently.Model 4: s ¼ s(DFA, DFC, DFAC, log(M+1)), only the totalnumber of pattern matches is log-transformed, noseparate m’s used.Model 5: s ¼ s(DFA, DFC, log(M+1), same as model 4 butwithout DFAC.Model 6: s ¼ s(DFAC, log(M+1), only DFAC (co-occur-rence) and log of M are used – the simplest model.

All of the models studied involve the number ofmatches (ms or M), because early in the study, we foundout that without them, the performance of the model wasnot statistically different from random guessing. It is stillpossible that without using patterns, and based on co-occurrence statistics only, it would be possible to buildpredictive models to improve question answering accu-racy since the correct answer frequently co-occurs withthe words from the question. Because we wanted toisolate the improvement due to semantic verification, wediscarded the models that did not improve the classifica-tion accuracy directly.

The numbers of pattern matches were obtained for thetop 30 (according to initial score sc) candidates by sendinga query to the underlying search engine (MSN.com in ourcase). Although the queries were sent in parallel, theverification would have slowed down the response of oursystem if it were used online. However, the real-timeperformance can be later addressed by having a pre-viously indexed, sufficiently large corpus (e.g. TRECcollections) or having direct access to the search engineindex, which may be feasible when a semantic verificationmodule is an integral part of the web search portal orconnected to it through a local (or otherwise fast)network.

5. Empirical evaluation

5.1. Data sets

We approach semantic verification as a classificationproblem. We are specifically interested in the estimates ofthe probability that an answer belongs to a specificcategory. In order to build the predictive model, we

Page 8: Exploring models for semantic category verification

ARTICLE IN PRESS

D. Roussinov, O. Turetken / Information Systems 34 (2009) 753–765760

created a labeled data set as follows. We took all the 68questions from the TREC 2003 set that explicitly specifieda semantic category. We did not use any questions thatonly implied a certain semantic category. For example, a‘‘who is’’ question would typically imply a person.Similarly, our categories under investigation did notinclude locations (where) and dates (when). We ran thequestions through our system without semantic verifica-tion and manually labeled the correctness of the semanticcategory as true/false for the top 30 candidate answers foreach question.

The target semantic category was identified by apply-ing 4 simple regular expressions to the questions. Forexample, the regular expression ‘‘(What|Which) (.+)(do|does|did|is|was|are|were)’’ would match a question‘‘What tourist attractions are there in Reims?’’ andidentify ‘‘tourist attraction’’ as a semantic category. Thisway, we obtained a data sample of (68�30 ¼ 2040) withalready known category memberships (dependent vari-able).

For testing, we sought to find a set that we did not usebefore. We went all the way back to TREC 2000 and usedall the 41 questions that explicitly specified semanticcategory of an expected answer – same principle as weused for constructing our training set.

5.2. Category prediction

Since the data set included disproportionate numbersof negative cases (incorrect semantic category), in order toevaluate classification accuracy, we over-sampled thepositive cases (correct semantic category) to build abalanced data sample. Table 2 displays the classificationaccuracy of our models listed above.

Table 2Classification accuracy of predicting membership in a given semantic

category.

Model Observed category: Predicted category Correct (%)

0 1

1 0 4938 118 97.7

1 2899 2028 41.2

Overall 69.8

2 0 4144 912 82.0

1 1924 3003 60.9

Overall 71.6

3 0 4356 700 86.2

1 2444 2483 50.4

Overall 68.5

4 0 4568 488 90.3

1 2262 2665 54.1

Overall 72.5

5 0 4569 487 90.4

1 2262 2665 54.1

Overall 72.5

6 0 4506 550 89.1

1 2340 2587 52.5

Overall 71.1

‘‘0’’ means wrong category, ‘‘1’’ means correct category.

To test whether this classification improvement overthat would be obtained purely by chance is statisticallysignificant, we examined the underlying Rate of Classifi-cation (ROC) curves. These curves plot sensitivity of theclassification (in classifying positive cases correctly)against false negative rates. The diagonal (reference) linerepresents the tie between a correct positive and falsenegative. As such, the further the curve lies above thereference line, the more accurate the test. Fig. 1 displaysthe ROC curves for the six classification models westudied. The test to determine whether the classificationaccuracy of a model is significantly better than whatwould be obtained by chance (diagonal line), it is testedwhether the area under the ROC curve is significantlyhigher than 0.5. All the models tested here performsignificantly better than chance (po0.01) leading to theconclusion that the variables used are useful predictors ofsemantic category.

In order to compare models with each other, weneeded a specific application since some of the modelsmay be better at recall while others at precision, whichresults from the very well known trade-off between Type Iand Type II classification errors. For the same reason, theaccuracy reported here cannot be compared to any of theprior research. For example, KnowItAll [13,15] evaluatedaccuracy based on the specially harvested data sets, whileour evaluation is based on the application of semanticverification to automated Question Answering.

5.3. Improving question answering accuracy

We evaluated the impact of semantic verification onour online fact seeking system using the sets of questionsand correct answer regular expressions from TRECcompetitions. Although various metrics have been ex-plored by researchers in the past, we used mean (acrossall the questions) reciprocal rank of the first correctanswer (MRR) as our primary metric, i.e. if the first answeris correct, the reciprocal rank is 1. If only the second iscorrect, then it is 1

2, etc. The drawback of this metric is thatit is not the most sensitive because it only considers thefirst correct answer, ignoring what follows. For thisreason, we also used mean total reciprocal rank of thecorrect answers (TRDR). It aggregates the reciprocal ranksof all the correct answers, e.g., if the second and the thirdare the only correct ones, TRDR would be equal to 1

2þ13.

We did not use the ‘‘degree of support’’ of the answerwithin the document as part of the metric due to itsknown difficulty [27], and, thus, only checked if theanswer was correct, which is sometimes called ‘‘lenient’’evaluation, to which the concerns of Lin et al. [27] do notapply.

TRDR measure is analogous to Cumulative Gainintroduced into evaluating document retrieval [22] wherelogarithmic discount factor is used along with gradedrelevance judgments instead of our binary (correct/wrong) ones. In order to compare across tasks withpossibly varying number of relevant documents, anormalized Cumulative Gain measure has also beenintroduced [23]. However, applying these metrics here

Page 9: Exploring models for semantic category verification

ARTICLE IN PRESS

1.00.80.60.40.20.01 - Specificity

1.0

0.8

0.6

0.4

0.2

0.0

Sens

itivi

ty

1.00.80.60.40.20.01 - Specificity

1.00.80.60.40.20.01 - Specificity

1.00.80.60.40.20.01 - Specificity

1.00.80.60.40.20.01 - Specificity

1.00.80.60.40.20.01 - Specificity

Model 1 Model 2

Model 3 Model 4

Model 5 Model 6

1.0

0.8

0.6

0.4

0.2

0.0

Sens

itivi

ty

1.0

0.8

0.6

0.4

0.2

0.0

Sens

itivi

ty

1.0

0.8

0.6

0.4

0.2

0.0

Sens

itivi

ty

1.0

0.8

0.6

0.4

0.2

0.0

Sens

itivi

ty

1.0

0.8

0.6

0.4

0.2

Sens

itivi

ty

Fig. 1. ROC curves for the six models.

D. Roussinov, O. Turetken / Information Systems 34 (2009) 753–765 761

was problematic because all the possible variations of thecorrect answers (all relevant ‘‘documents’’) would beextremely hard to enumerate for some questions. Besides,the question answering task is fundamentally differentfrom the document retrieval task since the former doesnot typically attempt to find all possible re-statements ofthe correct answers, while the latter attempts to identifyall possibly relevant documents from a finite set.

In assessing the performance of the models, we firstcomputed the baseline performance, which correspondedto model 0: no semantic verification. The resulting MRR

was 0.442. Next, we estimated what was the maximumpossible range of the improvement and computed the‘‘would be’’ performance if our semantic verification wereperfect. Thus, our ‘‘Oracle’’ returned 1 as semanticverification score if the candidate answer belonged tothe required category and 0 otherwise. This resulted inMRR and TRDR of 0.580, and 0.956 respectively. The totalproportion of questions that had a correct answer within aset of the candidate answers was 0.65. The loss inaccuracy from 0.65 to 0.58 is explained by the existenceof the highly scoring candidate answers that belonged to

Page 10: Exploring models for semantic category verification

ARTICLE IN PRESS

Table 3The performance of the models on the training set.

Model MRR TRDR Improvement over baseline

MRR (%) TDRR (%)

0 (base) 0.442 0.580 – –

1 0.455 0.618 +2.94* +6.55*

2 0.447 0.620 +1.13 +6.90

3 0.388 0.521 �12.22** �10.17**

4 0.482 0.650 +9.05** +12.07**

5 0.485 0.652 +9.73** +12.41**

6 0.459 0.621 +3.85 +7.07*

Oracle 0.580 0.956 +31 +64

** and * indicate statistical significance at the levels 0.05 and 0.1

accordingly.

Table 4Models performance on the ‘‘unseen’’ testing set.

Model MRR TRDR Improvement over base

MRR (%) TDRR (%)

0 (Base) 0.380 0.542 – –

1 0.420 0.588 +11** +8.5**

2 0.399 0.577 +5* +6.5**

3 0.398 0.581 +5* +7.2**

4 0.422 0.583 +11** +7.6**

5 0.433 0.596 +14** +10.0**

6 0.401 0.559 +6* +3.1

D. Roussinov, O. Turetken / Information Systems 34 (2009) 753–765762

the required category but were nevertheless wrong (e.g.What is the color of the sky? – orange).

To test how much our classification improves perfor-mance, we compared the MRR and TRDR values form thesix models we have built to the base model (Model 0). Thesummary results are displayed in Table 3. As Table 3indicates, all the models, except for model 3, improve theperformance over the baseline. Models 4 and 5 areespecially promising. The results suggest that by usingour semantic verification approach, the question answer-ing performance can be improved (in terms of both MRRand TRDR) significantly especially when (log-trans-formed) pattern matches are used (models 4 and 5). Theresults also indicate that treating numbers of patternmatches separately instead of as one aggregate variable donot result in statistically detectable additional improve-ment. Obviously, using larger data sets may fine-grain theresults reported here.

5.4. Testing on a previously unseen set

Since the first two testing steps performed aboveconstituted training (tuning) steps, in order to providehigher objectivity of the results and to avoid possibledangers of overfitting, we also performed tests on a set ofquestions that the system had never seen before (testing set).

We did not manually label the testing set into correct/wrong categories of the candidate answers and onlymeasured the effect of our semantic verification on theanswer accuracy using the same metrics of MRR and TDRRas described above. Table 4 presents the results. Theimprovement ranged from +6% to +16% depending on themodel. All of the model results are significantly different(po0.05) from those of the baseline (no verification). Theresults clearly indicate that the improvement from oursemantic verification is noticeable. The correlation be-tween the performance of the models on the training set(Table 2) and on testing set is also noticeable. Future, moredetailed tests may establish this with higher reliability.

5.5. Analysis of the challenges and suggested remedies

After running our quantitative tests, we performed adetailed qualitative error analysis. In this section we

elaborate on the specific factors contributing to the errorsand the general challenges they represent. We alsosuggest specific remedies to overcome those challengesand our qualitative evaluation of those remedies. Webelieve that not following a formal methodology allowedus to look deeper into the nature of the challenges toautomated factoid question answering rather than focus-ing on comparing one approach to another based on acertain set of questions assumed to be a ‘‘gold standard’’for such reasons as convenience or popularity. Thus, at thehighest level we posed the following research questions:(1) What obstacles keep factoid question answering from

being perfectly accurate? and (2) How realistic is it to

overcome those obstacles with the current technology?First, we manually inspected the detailed log files that

our system created for each question attempted from allthe TREC test sets up to the year 2006. We weredeliberately looking for questions that proved to be hardfor the algorithms involved, and that can not be easilyapproached by parameter tuning and adding more simpleheuristics applicable to only one or a small group ofquestions. Again, since the types and topics of factoidquestions are extremely diverse and due to the nature ofthis current study, we limited our attention to thequestions in which the answer was expected to fit aspecific semantic category, and not one of the mostcommon categories (person, place, date or number). Weelaborate below on our observations.

(1)

More proactive search is needed: For some questions,the correct answer was not in the initial set ofcandidate answers. For example, for the questionWhat record company is Fred Durst with?, we noticedthat the correct answer (‘‘Interscope Records’’) hasobjectively a very low chance of entering the initial setsince it does not frequently co-occur with the wordsfrom the question and does not match any of theanswer patterns that we had (e.g. there is no simpleanswer sentence such as Fred Durst is with Interscope

Records). There are phrases more frequently co-occurring with the words from the question that willalways dominate the correct answer. However, noneof those phrases represents a record company. Ratherthan expanding the initial candidate answer set,which may in theory become infinite, we suggest thefollowing remedy. The search for the correct answershould proceed in two streams: in addition to the

Page 11: Exploring models for semantic category verification

ARTICLE IN PRESS

D. Roussinov, O. Turetken / Information Systems 34 (2009) 753–765 763

current approach to the search process that startswith the snippets returned by the underlying searchengine, another search process should proceed bychecking all the members of the expected semanticcategory (e.g. record companies) if they also match theanswer patterns or frequently associate with thewords from the question. We manually verified thatmany other members (e.g. Columbia Records, EMI,

Warner Bros, including the correct answer Interscope

Records) can be mined using the same patterns aslisted in Table 1 but without a specific candidateinstance, i.e. as the text immediately preceding thephrase ‘‘is a record company.’’

(2)

Limitation of the pattern based approach: False posi-tives can always happen within semantic verificationthat is based solely on pattern matches (as opposed toparsing). The only common scenario that we noticedwhen such false positives negatively impacted theperformance was happening when a candidate answerfrequently occurred within another stable phrase. Forexample, a pure pattern matching approach willalways result in the belief that music is a record

company because of such sentences as Guild Music is a

record company or Bodog Music is a record company. Arelatively simple dependency parsing should be ableto discard those false negatives. For the purposediscussed here, the dependency parsing only needsto establish that music matches one of the verificationpatterns only when it is a part of another stable phrase(e.g. Guild Music) and thus the verb clause is a record

company applies to the entire phrase rather than justto music alone. Although this kind of parsing wouldseemingly be required for all the pattern matches andpotentially be time prohibitive with the currentparsing and downloading speeds, the approach canbe heuristically optimized to be applied only to themost ‘‘promising’’ candidate answers, those that cannot be discarded by the models introduced in thispaper. We have verified informally that it is sufficientto apply a dependency parsing based on minimizationof the pointwise mutual information (PMI) as it wassuggested for the linguistic oddity detection by Fonget al. [21]. For example, the phrase Guild Music hasmuch higher PMI than the phrase Music is a record

company and thus, the minimization algorithm iden-tifies guild as a dependant of music and suggestsignoring the pattern match occurring in a sentencesuch as Guild Music is a record company whenvalidating the candidate answer music. On the otherhand, the sentence The famous Interscope is a record

company would still produce a valid match whenvalidating the candidate answer Interscope, becausethe PMI of the phrase famous Interscope is smaller thanthe PMI of the phrase Interscope is a record company.

(3)

The considered set of patterns is simply not always

enough: For example, the question What rank did

Eileen Marie Collins reach? exemplifies two seriouschallenges: (1) The good answers are context specific.In this case, they have to be military (not e.g.mathematical) ranks. (2) A number of valid answers(major, general, private, captain) are very polysemic

words: they have multiple common meanings. Afterexperimenting with a number of possible heuristicsolutions, we have established that the following twomodifications are sufficient to overcome those chal-lenges:(a) Context string: Adding a string representing the

context of the question to the queries submitted tothe search engine. For example, the following‘‘context string’’ can be automatically constructedfrom the list of the top most frequently co-occurring words with the words from the examplequestion: air force pilot military NASA. Since thewords in the context strings automatically con-structed this way tend to be very frequent(general) words, they do not result in sharp dropsof the number of pages returned by the underlyingsearch engine. At the same time, it was obviousafter manual inspection, that the context of the90% of the returned pages was relevant to the onein question (e.g. military or NASA) and that theobtained statistics indeed addressed the problem.

(b) Category specific patterns: learning category spe-cific patterns ‘‘on the fly’’, that is from the mostconfident instances (seeds), e.g. from colonel,lieutenant, or sergeant in the example in theprevious paragraph. While trying this approach,we were able to (semi-automatically) discoverfrom the Internet the following instance specificpatterns: took rank of X, promoted to X, promoted to

rank of X, after serving fifteen years as X, commis-

sioned as X, appointed X, positions at the rank of X, X

Rank Insignia, X (military rank), X Military Rank

History, where X is a specific military rank. Wehave indeed verified that applying this heuristicapproach allows to answer this and other similarlychallenging questions (e.g. What industry is Rohm

and Haas in?).

Finally, we verified informally and semi-automaticallythat all the questions from the TREC test sets up to theyear 2006 with the semantic category specified (notperson, place, date or number) can be answered with 100%accuracy if the remedies suggested above were properlyimplemented.

6. Conclusions

We have explored a pattern based semantic verifica-tion algorithm and demonstrated that it improves the

accuracy of answers produced by a redundancy-based

question answering system. The overall performance ofthe question answering system that we used wasevaluated earlier and found comparable to that of othercurrent state of the art systems that do not require anelaborate knowledge base. Although, the systems thatreported the best performance at TREC competitionssurpass the ‘‘knowledge-light’’ systems including ours,those top performing systems do not disclose theiralgorithms completely, and do not make the underlyingknowledge available to other researchers. This makes

Page 12: Exploring models for semantic category verification

ARTICLE IN PRESS

D. Roussinov, O. Turetken / Information Systems 34 (2009) 753–765764

replication and follow up practically impossible. That iswhy we were not able to test our method within any of thebest (top 3) systems, and this may be considered alimitation of this research. However, the algorithmsbehind knowledge-light systems like AskMSR [14] or ourshave been already studied by a number of researchers andtheir performance, even though numerically inferior tothe best commercial systems, has been confirmed andanalyzed.

To our knowledge, this is the first study on completelyautomated verification of the membership in a previouslyunanticipated semantic category for the purpose ofimproving question answering. For this reason, we donot have any other technique to report here for compar-ison. Our method is different in terms of coverage andmanual effort necessary from those based on WordNet orother manually developed taxonomies. Since in our sets ofquestions WordNet contained the entries only for 12% ofthe categories and 6% of candidate answers involved, wedo not believe that a method based solely on WordNetwould provide comparable results. Indeed, manyof the categories in our test set were phrases (Americanfeminist, sporting event, Ridley Scott movie, etc.), ofwhich there were no direct entries in WordNet. Besides,our approach can be complementary to those basedon WordNet (or other similar resources) since it is basedon an entirely different source of data: patterns ofoccurrence on the Web (or any other comparably largecorpus) rather than a manually created ontology. For thosereasons we did not think that comparing our approach tothose based on sources such as WordNet is scientificallyjustified.

Following our formal testing, we performed detailedinformal error analysis. We investigated what challengescurrently stay in the way of providing each question withan answer that is within a correct semantic category. Forall the serious challenges identified, we suggested possi-ble remedies and informally verified that those remediesindeed completely address the challenges.

We believe that our results are extremely encouragingand the approaches studied here will lead to error-freeanswering of questions without involving any pre-antici-pation of types and topics. Since it is commonly believedthat categorization is behind many cognitive tasks,automated simulation of verification of a membership ina given unrestricted semantic category has the potentialto be a milestone within the broader scope of humancognition and artificial intelligence research.

References

[1] E. Agichtein, S. Lawrence, L. Gravano, Learning search engine

specific query transformations for question answering, in: 10th

WWW Conference, Hong Kong, China, May 1–5, 2001.[2] E. Agirre, O. Ansa, E. Hovy, D. Martinez, Enriching very large

ontologies using the WWW, in: Proceedings of the ECAI Ontology

Learning Workshop, 2000.[3] K. Ahmad, M. Tariq, B. Vrusias, C. Handy, Corpus-based thesaurus

construction for image retrieval in specialist domains, in: Proceed-

ings of the 25th European Conference on Advances in Information

Retrieval (ECIR), 2003, pp. 502–510.[4] E. Alfonseca, S. Manandhar, Extending a lexical ontology by a

combination of distributional semantics signatures, in: Proceedings

of the 13th International Conference on Knowledge Engineering andKnowledge Management (EKAW 2002), 2002, pp. 1–7.

[5] E. Brill, Transformation-based error-driven learning and naturallanguage processing: a case study in part of speech tagging,Computational Linguistics 21 (4) (1995) 543–566.

[7] S. Brin, Extracting patterns and relations from the World Wide Web,1998. In: Proceedings of the WebDB Workshop at EDBT, ‘98, 1998.

[8] E. Charniak, M. Berland, Finding parts in very large corpora, in:Proceedings of the 37th Annual Meeting of the Association forComputational Linguistics (ACL), 1999, pp. 57–64.

[9] P. Cimiano, G. Ladwig, S. Staab, Gimme’ the context: context-drivenautomatic semantic annotation with C-PANKOW, in: Proceedings ofthe 14th World Wide Web Conference, 2005.

[10] P. Cimiano, S. Handschuh, S. Staab, Towards the self-annotatingweb, in: Proceedings of the 13th World Wide Web Conference,2004, pp. 462–471.

[11] F. Ciravegna, A. Dingli, D. Guthrie, Y. Wilks, Integrating informationto bootstrap information extraction from web sites, in: Proceedingsof the IJCAI Workshop on Information Integration on the Web, 2003,pp. 9–14.

[12] S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha, A. Jhingran, T.Kanungo, S. Rajagopalan, A. Tomkins, J.A. Tomlin, J.Y. Zien, Semtagand seeker: bootstrapping the semantic web via automatedsemantic annotation, in: Proceedings of the 12th InternationalWorld Wide Web Conference, ACM Press, 2003, pp. 178–186.

[13] D. Downey, Oren Etzioni, Stephen Soderland, A Probabilistic Modelof Redundancy in Information Extraction, IJCAI-05, 2005.

[14] S. Dumais, M. Banko, E. Brill, J. Lin, A. Ng, Web question answering:is more always better? in: Proceedings of the 25th AnnualInternational ACM SIGIR Conference on Research and Developmentin Information Retrieval, Tampere, Finland, August 11–15, 2002.

[15] O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S.Soderland, D. Weld, A. Yates, Unsupervised named-entity extractionfrom the Web: an experimental study, Artificial Intelligence, 2005.

[16] M. Fleischman, E. Hovy, Fine grained classification of namedentities, in: Proceedings of the Conference on ComputationalLinguistics, Taipei, Taiwan, August 2002.

[17] S.W. Fong, D. Roussinov, D.B. Skillicorn, Detecting word substitu-tions in text, IEEE Transactions on Knowledge and Data Engineering20 (8) (2008) 1067–1076.

[18] R. Girju, M. Moldovan, Text mining for causal relations, in:Proceedings of the FLAIRS Conference, 2002, pp. 360–364.

[19] S. Harabagiu, D. Moldovan, M. Pasca, R. Mihalcea, M. Surdeanu, R.Bunescu, R. Girju, V. Rus, Paul Morarescu, The role of lexico-semantic feedback in open-domain textual question-answering, in:Proceedings of the Association for Computational Linguistics, July2001, 2001, pp. 274–281.

[20] S. Harabagiu, D. Moldovan, M. Pasca, R. Mihalcea, M. Surdeanu, R.Bunescu, R. Gırju, V. Rus, P. Morarescu, Falcon: Boosting knowledgefor answer engines, in: NIST Special Publication 500–249: the NinthText REtrieval Conference (TREC 9), Gaithersburg, MD, November13–16, 2000, pp. 479–488.

[21] M.A. Hearst, Automatic acquisition of hyponyms from large textcorpora, in: Proceedings of the 14th International Conference onComputational Linguistics, 1992.

[22] K. Jarvelin, J. Kekalainen, IR evaluation methods for retrievinghighly relevant documents, in: N. Belkin, P. Ingwersen, M.-K. Leong(Eds.), Proceedings of the 23rd Annual International ACM SIGIRConference on Research and Development in Information Retrieval(ACM SIGIR ‘00), Athens, Greece, July 24–28, ACM Press, New York,NY, 2000, pp. 41–48 (Receiver of the ACM SIGIR ‘00 Best PaperAward).

[23] K. Jarvelin, J. Kekalainen, Cumulated gain-based evaluation of IRtechniques, ACM Transactions on (Office) Information Systems 20(4) (2002) 422–446.

[25] B. Katz, Jimmy Lin, Daniel Loreto, Wesley Hildebrandt, MatthewBilotti, Sue Felshin, Aaron Fernandes, Gregory Marton, FedericoMora, Integrating web-based and corpus-based techniques forquestion answering, in: Proceedings of the Twelfth Text REtrievalConference (TREC 2003), November 2003, Gaithersburg, MD, 2003.

[26] C. Kwok, Oren Etzioni, Daniel S. Weld, Scaling question answeringto the web, in: Proceedings of the International WWW Confer-ence(10), Hong-Kong, 2001.

[27] J. Lin, Evaluation of resources for question answering evaluation, in:Proceedings of ACM Conference on Research and Development inInformation Retrieval, 2005.

[28] K.K. Markert, Modjeska, N., M. Nissim, Using the web for nominalanaphora resolution, in: EACL Workshop on the ComputationalTreatment of Anaphora, 2003.

Page 13: Exploring models for semantic category verification

ARTICLE IN PRESS

D. Roussinov, O. Turetken / Information Systems 34 (2009) 753–765 765

[29] D. Moldovan, M. Pasca, S. Harabagiu, M. Surdeanu, Performanceissues and error analysis in an open domain question answeringsystem, in: Proceedings of ACL-2002, 2002, pp. 33–40.

[30] M. Poesio, T. Ishikawa, S. Schulte im Walde, R. Viera, Acquiringlexical knowledge for anaphora resolution, in: Proceedings of the3rd Conference on Language Resources and Evaluation (LREC),2002.

[31] D.R. Radev, H. Qi, Z. Zheng, S. Blair-Goldensohn, Z. Zhang, W. Fan, J.Prager, Mining the web for answers to natural language questions,in: Proceedings of ACM CIKM 2001: Tenth International Conferenceon Information and Knowledge Management, Atlanta, GA, Novem-ber 5–10, 2001.

[32] D. Radev, W. Fan, H. Qi, H. Wu, A. Grewal, Probabilistic questionanswering on the web, Journal of the American Society forInformation Science and Technology 56 (3) (2005).

[33] D. Ravichandran, E. Hovy, Learning surface text patterns for aquestion answering system, in: Proceedings of ACL, 2002.

[35] D. Roussinov, M. Chau, E. Filatova, J. Robles, Building on redun-dancy: factoid question answering, robust retrieval and the ‘‘other,’’in: Proceedings of Text REtrieval Conference (TREC), National

Institute of Standards and Technology (NIST), November 15–18,2005.

[36] D. Roussinov, W. Fan, J. Robles, Beyond keywords: automatedquestion answering on the web, Communications of the ACM(CACM), September 2008.

[37] S. Schlobach, M. Olsthoorn, M. de Rijke, Type checking in open-domain question answering (extended abstract), in: R. Verbrugge,N. Taatgen, L. Schomaker (Eds.), Proceedings BNAIC 2004, 2004, pp.367–368.

[38] M. Soubbotin, S. Soubbotin, Use of patterns for detection of likelyanswer strings: a systematic approach, in: Proceedings of theEleventh Text Retrieval Conference TREC 2002. Gaithersburg, MD,November 19–22, 2002.

[39] E. Voorhees, L.P. Buckland (Eds.), in: Proceedings of the EleventhText Retrieval Conference TREC 2004, Gaithersburg, MD, November2004.

[40] E. Voorhees, L.P. Buckland, (Eds.), in: Proceedings of the EleventhText Retrieval Conference TREC 2005, Gaithersburg, MD, November2005.

[42] /www.qa.wpcarey.asu.eduS.