ferrucci-watson2010 - build watson - an overview of deepqa project

8/2/2019 Ferrucci-Watson2010 - Build Watson - An Overview of DeepQA Project

1/21

The goals o IBM Research are to advance computer scienceby exploring new ways or computer technology to aectscience, business, and society. Roughly three years ago,

IBM Research was looking or a major research challenge to rivalthe scientific and popular interest o Deep Blue, the computerchess-playing champion (Hsu 2002), that also would have clearrelevance to IBM business interests.

With a wealth o enterprise-critical inormation being cap-tured in natural language documentation o all orms, the prob-

lems with perusing only the top 10 or 20 most popular docu-ments containing the users two or three key words arebecoming increasingly apparent. This is especially the case inthe enterprise where popularity is not as important an indicatoro relevance and where recall can be as critical as precision.There is growing interest to have enterprise computer systemsdeeply analyze the breadth o relevant content to more precise-ly answer and justiy answers to users natural language ques-tions. We believe advances in question-answering (QA) tech-nology can help support proessionals in critical and timelydecision making in areas like compliance, health care, businessintegrity, business intelligence, knowledge discovery, enterpriseknowledge management, security, and customer support. For

Articles

FALL 2010 59Copyright 2010, Association or the Advancement o Artifcial Intelligence. All rights reserved. ISSN 0738-4602

Building Watson:

An Overview o the

DeepQA Project

David Ferrucci, Eric Brown, Jennier Chu-Carroll,

James Fan, David Gondek, Aditya A. Kalyanpur,

Adam Lally, J. William Murdock, Eric Nyberg, John Prager,

Nico Schlaeer, and Chris Welty

I IBM Research undertook a challenge to build

a computer system that could compete at thehuman champion level in real time on the

American TV quiz show, Jeopardy. The extent

o the challenge includes felding a real-timeautomatic contestant on the show, not merely a

laboratory exercise. The Jeopardy Challenge

helped us address requirements that led to thedesign o the DeepQA architecture and the

implementation o Watson. Ater three years o

intense research and development by a coreteam o about 20 researchers, Watson is per-

orming at human expert levels in terms o pre-cision, confdence, and speed at the Jeopardy

quiz show. Our results strongly suggest that

DeepQA is an eective and extensible architec-ture that can be used as a oundation or com-

bining, deploying, evaluating, and advancing awide range o algorithmic techniques to rapidly

advance the feld o question answering (QA).


2/21

researchers, the open-domain QA problem isattractive as it is one o the most challenging in therealm o computer science and artificial intelli-gence, requiring a synthesis o inormationretrieval, natural language processing, knowledgerepresentation and reasoning, machine learning,and computer-human interaces. It has had a longhistory (Simmons 1970) and saw rapid advance-

ment spurred by system building, experimenta-tion, and government unding in the past decade(Maybury 2004, Strzalkowski and Harabagiu 2006).

With QA in mind, we settled on a challenge tobuild a computer system, called Watson,1 whichcould compete at the human champion level inreal time on the American TV quiz show,Jeopardy.The extent o the challenge includes fielding a real-time automatic contestant on the show, not mere-ly a laboratory exercise.

Jeopardy! is a well-known TV quiz show that hasbeen airing on television in the United States ormore than 25 years (see the Jeopardy! Quiz Showsidebar or more inormation on the show). It pits

three human contestants against one another in acompetition that requires answering rich naturallanguage questions over a very broad domain otopics, with penalties or wrong answers. The natureo the three-person competition is such that confi-dence, precision, and answering speed are o criticalimportance, with roughly 3 seconds to answer eachquestion. A computer system that could compete athuman champion levels at this game would need toproduce exact answers to oten complex naturallanguage questions with high precision and speedand have a reliable confidence in its answers, suchthat it could answer roughly 70 percent o the ques-tions asked with greater than 80 percent precision

in 3 seconds or less.Finally, the Jeopardy Challenge represents a

unique and compelling AI question similar to theone underlying DeepBlue (Hsu 2002)can a com-puter system be designed to compete against thebest humans at a task thought to require high lev-els o human intelligence, and i so, what kind otechnology, algorithms, and engineering isrequired? While we believe theJeopardyChallengeis an extraordinarily demanding task that willgreatly advance the field, we appreciate that thischallenge alone does not address all aspects o QAand does not by any means close the book on theQA challenge the way that Deep Blue may have or

playing chess.

TheJeopardyChallenge

Meeting theJeopardyChallenge requires advancingand incorporating a variety o QA technologiesincluding parsing, question classification, questiondecomposition, automatic source acquisition andevaluation, entity and relation detection, logical

orm generation, and knowledge representation

and reasoning.Winning atJeopardyrequires accurately comput-

ing confidence in your answers. The questions andcontent are ambiguous and noisy and none o theindividual algorithms are perect. Thereore, each

component must produce a confidence in its out-put, and individual component confidences must

be combined to compute the overall confidence othe final answer. The final confidence is used todetermine whether the computer system should

risk choosing to answer at all. InJeopardyparlance,this confidence is used to determine whether thecomputer will ring in or buzz in or a question.The confidence must be computed during the timethe question is read and beore the opportunity tobuzz in. This is roughly between 1 and 6 seconds

with an average around 3 seconds.Confidence estimation was very critical to shap-

ing our overall approach in DeepQA. There is noexpectation that any component in the systemdoes a perect joball components post eatures

o the computation and associated confidences,and we use a hierarchical machine-learningmethod to combine all these eatures and decidewhether or not there is enough confidence in thefinal answer to attempt to buzz in and risk gettingthe question wrong.

In this section we elaborate on the variousaspects o theJeopardyChallenge.

The Categories

A 30-clue Jeopardy board is organized into sixcolumns. Each column contains five clues and isassociated with a category. Categories range rom

broad subject headings like history, science, orpolitics to less inormative puns like tutumuch, in which the clues are about ballet, to actu-al parts o the clue, like who appointed me to theSupreme Court? where the clue is the name o a

judge, to anything goes categories like pot-pourri. Clearly some categories are essential tounderstanding the clue, some are helpul but notnecessary, and some may be useless, i not mis-leading, or a computer.

A recurring theme in our approach is the require-

ment to try many alternate hypotheses in varyingcontexts to see which produces the most confidentanswers given a broad range o loosely coupled scor-ing algorithms. Leveraging category inormation isanother clear area requiring this approach.

The Questions

There are a wide variety o ways one can attempt tocharacterize the Jeopardy clues. For example, bytopic, by dificulty, by grammatical construction,by answer type, and so on. A type o classificationthat turned out to be useul or us was based on the

primary method deployed to solve the clue. The

Articles

60 AI MAGAZINE


3/21

Articles

FALL 2010 61

TheJeopardy! quiz show is a well-known syndicat-ed U.S. TV quiz show that has been on the airsince 1984. It eatures rich natural language ques-tions covering a broad range o general knowl-edge. It is widely recognized as an entertaininggame requiring smart, knowledgeable, and quickplayers.

The shows ormat pits three human contestantsagainst each other in a three-round contest oknowledge, confidence, and speed. All contestantsmust pass a 50-question qualiying test to be eligi-ble to play. The first two rounds o a game use agrid organized into six columns, each with a cate-gory label, and five rows with increasing dollar

values. The illustration shows a sample board or afirst round. In the second round, the dollar valuesare doubled. Initially all the clues in the grid arehidden behind their dollar values. The game playbegins with the returning champion selecting acell on the grid by naming the category and thedollar value. For example the player may select bysaying Technology or $400.

The clue under the selected cell is revealed to allthe players and the host reads it out loud. Eachplayer is equipped with a hand-held signaling but-ton. As soon as the host finishes reading the clue,a light becomes visible around the board, indicat-ing to the players that their hand-held devices are

enabled and they are ree to signal or buzz in ora chance to respond. I a player signals beore thelight comes on, then he or she is locked out orone-hal o a second beore being able to buzz inagain.

The first player to successully buzz in gets achance to respond to the clue. That is, the playermust answer the question, but the response mustbe in the orm o a question. For example, validlyormed responses are, Who is Ulysses S. Grant?or What is The Tempest? rather than simplyUlysses S. Grant or The Tempest. TheJeopardyquiz show was conceived to have the host provid-ing the answer or clue and the players responding

with the corresponding question or response. Theclue/response concept represents an entertainingtwist on classic question answering.Jeopardycluesare straightorward assertional orms o questions.So where a question might read, What drug hasbeen shown to relieve the symptoms o ADD with

relatively ew side eects? the correspondingJeopardy clue might read This drug has beenshown to relieve the symptoms o ADD with rela-tively ew side eects. The correct Jeopardyresponse would be What is Ritalin?

Players have 5 seconds to speak their response,but its typical that they answer almost immedi-ately since they oten only buzz in i they alreadyknow the answer. I a player responds to a clue cor-rectly, then the dollar value o the clue is added tothe players total earnings, and that player selectsanother cell on the board. I the player respondsincorrectly then the dollar value is deducted romthe total earnings, and the system is rearmed,

allowing the other players to buzz in. This makesit important or players to know what they know

to have accurate confidences in their responses.There is always one cell in the first round and

two in the second round called Daily Doubles,whose exact location is hidden until the cell isselected by a player. For these cases, the selectingplayer does not have to compete or the buzzer butmust respond to the clue regardless o the playersconfidence. In addition, beore the clue is revealedthe player must wager a portion o his or her earn-ings. The minimum bet is $5 and the maximumbet is the larger o the players current score andthe maximum clue value on the board. I players

answer correctly, they earn the amount they bet,else they lose it.

The Final Jeopardy round consists o a singlequestion and is played dierently. First, a catego-ry is revealed. The players privately write downtheir betan amount less than or equal to theirtotal earnings. Then the clue is revealed. Theyhave 30 seconds to respond. At the end o the 30seconds they reveal their answers and then theirbets. The player with the most money at the endo this third round wins the game. The questionsused in this round are typically more dificult thanthose used in the previous rounds.

The Jeopardy!Quiz Show


4/21

bulk oJeopardy clues represent what we wouldconsider actoid questions questions whoseanswers are based on actual inormation aboutone or more individual entities. The questionsthemselves present challenges in determiningwhat exactly is being asked or and which elementso the clue are relevant in determining the answer.Here are just a ew examples (note that while the

Jeopardy! game requires that answers are deliveredin the orm o a question (see the Jeopardy! QuizShow sidebar), this transormation is trivial and orpurposes o this paper we will just show theanswers themselves):

Category: General Science

Clue: When hit by electrons, a phosphor gives o

electromagnetic energy in this orm.

Answer: Light (or Photons)

Category: Lincoln Blogs

Clue: Secretary Chase just submitted this to me or

the third time; guess what, pal. This time Im

accepting it.

Answer: his resignation

Category: Head NorthClue: Theyre the two states you could be reentering

i youre crossing Floridas northern border.

Answer: Georgia and Alabama

Decomposition. Some more complex clues con-tain multiple acts about the answer, all o whichare required to arrive at the correct response butare unlikely to occur together in one place. Forexample:

Category: Rap Sheet

Clue: This archaic term or a mischievous or annoy-

ing child can also mean a rogue or scamp.

Subclue 1: This archaic term or a mischievous or

annoying child.

Subclue 2: This term can also mean a rogue orscamp.

Answer: Rapscallion

In this case, we would not expect to find bothsubclues in one sentence in our sources; rather,i we decompose the question into these two partsand ask or answers to each one, we may find thatthe answer common to both questions is theanswer to the original clue.

Another class o decomposable questions is onein which a subclue is nested in the outer clue, andthe subclue can be replaced with its answer to orma new question that can more easily be answered.For example:

Category: Diplomatic Relations

Clue: O the our countries in the world that the

United States does not have diplomatic relations

with, the one thats arthest north.

Inner subclue: The our countries in the world that

the United States does not have diplomatic rela-

tions with (Bhutan, Cuba, Iran, North Korea).

Outer subclue: O Bhutan, Cuba, Iran, and North

Korea, the one thats arthest north.

Answer: North Korea

DecomposableJeopardyclues generated require-

ments that drove the design o DeepQA to gener-

ate zero or more decomposition hypotheses oreach question as possible interpretations.

Puzzles. Jeopardyalso has categories o questions

that require special processing defned by the cate-gory itsel. Some o them recur oten enough that

contestants know what they mean without

instruction; or others, part o the task is to fgureout what the puzzle is as the clues and answers are

revealed (categories requiring explanation by the

host are not part o the challenge). Examples owell-known puzzle categories are the Beore and

Ater category, where two subclues have answers

that overlap by (typically) one word, and theRhyme Time category, where the two subclue

answers must rhyme with one another. Clearly

these cases also require question decomposition.For example:

Category: Beore and Ater Goes to the Movies

Clue: Film o a typical day in the lie o the Beatles,

which includes running rom bloodthirsty zombieans in a Romero classic.

Subclue 2: Film o a typical day in the lie o the Bea-

tles.

Answer 1: (A Hard Days Night)

Subclue 2: Running rom bloodthirsty zombie ans

in a Romero classic.

Answer 2: (Night o the Living Dead)

Answer: A Hard Days Night o the Living Dead

Category: Rhyme Time

Clue: Its where Pele stores his ball.

Subclue 1: Pele ball (soccer)

Subclue 2: where store (cabinet, drawer, locker, and

so on)

Answer: soccer locker

There are many inrequent types o puzzle cate-gories including things like converting roman

numerals, solving math word problems, soundslike, finding which word in a set has the highest

Scrabble score, homonyms and heteronyms, and

so on. Puzzles constitute only about 23 percent oall clues, but since they typically occur as entire

categories (five at a time) they cannot be ignored

or success in the Challenge as getting them allwrong oten means losing a game.

Excluded Question Types. TheJeopardyquiz show

ordinarily admits two kinds o questions that IBMand Jeopardy Productions, Inc., agreed to exclude

rom the computer contest: audiovisual (A/V)questions and Special Instructions questions. A/Vquestions require listening to or watching some

sort o audio, image, or video segment to deter-

mine a correct answer. For example:

Category: Picture This

(Contestants are shown a picture o a B-52 bomber)

Clue: Alphanumeric name o the earsome machine

seen here.

Answer: B-52

Articles

62 AI MAGAZINE


5/21

Special instruction questions are those that are

not sel-explanatory but rather require a verbal

explanation describing how the question should

be interpreted and solved. For example:

Category: Decode the Postal Codes

Verbal instruction rom host: Were going to give you

a word comprising two postal abbreviations; you

have to identiy the states.

Clue: Vain

Answer: Virginia and IndianaBoth present very interesting challenges rom an

AI perspective but were put out o scope or this

contest and evaluation.

The Domain

As a measure o theJeopardyChallenges breadth o

domain, we analyzed a random sample o 20,000

questions extracting the lexical answer type (LAT)

when present. We define a LAT to be a word in the

clue that indicates the type o the answer, inde-

pendent o assigning semantics to that word. For

example in the ollowing clue, the LAT is the string

maneuver.

Category: Oooh.Chess

Clue: Invented in the 1500s to speed up the game,

this maneuver involves two pieces o the same col-

or.

Answer: Castling

About 12 percent o the clues do not indicate an

explicit lexical answer type but may reer to the

answer with pronouns like it, these, or this

or not reer to it at all. In these cases the type o

answer must be inerred by the context. Heres an

example:

Category: Decorating

Clue: Though it sounds harsh, its just embroi-

dery, oten in a loral pattern, done with yarn on

cotton cloth.

Answer: crewel

The distribution o LATs has a very long tail, as

shown in figure 1. We ound 2500 distinct and

explicit LATs in the 20,000 question sample. Themost requent 200 explicit LATs cover less than 50

percent o the data. Figure 1 shows the relative re-

quency o the LATs. It labels all the clues with no

explicit type with the label NA. This aspect o the

challenge implies that while task-specific type sys-

tems or manually curated data would have some

impact i ocused on the head o the LAT curve, it

still leaves more than hal the problems unaccount-ed or. Our clear technical bias or both business and

scientific motivations is to create general-purpose,

reusable natural language processing (NLP) and

knowledge representation and reasoning (KRR)

technology that can exploit as-is natural language

resources and as-is structured knowledge ratherthan to curate task-specific knowledge resources.

The Metrics

In addition to question-answering precision, the

systems game-winning perormance will depend

on speed, confidence estimation, clue selection,

and betting strategy. Ultimately the outcome o

Articles

FALL 2010 63

Figure 1. Lexical Answer Type Frequency.

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

NA

he

country

city

manflm

state

she

author

group

here

company

president

capital

star

novel

character

woman

river

island

king

song

part

series

sport

singer

actor

play

team

show

actress

animal

presidential

composer

musical

nation

book

title

leader

game

40 Most Frequent LATs

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

200 Most Frequent LATs


6/21

the public contest will be decided based onwhether or not Watson can win one or two games

against top-ranked humans in real time. The high-est amount o money earned by the end o a one-

or two-game match determines the winner. A play-ers final earnings, however, oten will not reflecthow well the player did during the game at the QAtask. This is because a player may decide to bet bigon Daily Double or FinalJeopardyquestions. There

are three hidden Daily Double questions in a gamethat can aect only the player lucky enough tofind them, and one Final Jeopardyquestion at the

end that all players must gamble on. Daily Doubleand Final Jeopardyquestions represent significantevents where players may risk all their currentearnings. While potentially compelling or a pub-

lic contest, a small number o games does not rep-resent statistically meaningul results or the sys-tems raw QA perormance.

While Watson is equipped with betting strate-gies necessary or playing ullJeopardy, rom a coreQA perspective we want to measure correctness,confidence, and speed, without considering clue

selection, luck o the draw, and betting strategies.We measure correctness and confidence using pre-cision and percent answered. Precision measures

the percentage o questions the system gets right

out o those it chooses to answer. Percent answeredis the percentage o questions it chooses to answer

(correctly or incorrectly). The system chooseswhich questions to answer based on an estimated

confidence score: or a given threshold, the systemwill answer all questions with confidence scoresabove that threshold. The threshold controls thetrade-o between precision and percent answered,assuming reasonable confidence estimation. For

higher thresholds the system will be more conser-vative, answering ewer questions with higher pre-cision. For lower thresholds, it will be more aggres-

sive, answering more questions with lowerprecision. Accuracy reers to the precision i allquestions are answered.

Figure 2 shows a plot o precision versus percent

attempted curves or two theoretical systems. It isobtained by evaluating the two systems over arange o confidence thresholds. Both systems have

40 percent accuracy, meaning they get 40 percento all questions correct. They dier only in theirconfidence estimation. The upper line representsan ideal system with perect confidence estima-

tion. Such a system would identiy exactly whichquestions it gets right and wrong and give higherconfidence to those it got right. As can be seen in

the graph, i such a system were to answer the 50

Articles

64 AI MAGAZINE

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 20% 40% 60% 80% 100%

Precision

% Answered

Figure 2. Precision Versus Percentage Attempted.

Perect confdence estimation (upper line) and no confdence estimation (lower line).


7/21

percent o questions it had highest confidence or,

it would get 80 percent o those correct. We reer tothis level o perormance as 80 percent precision at50 percent answered. The lower line represents a

system without meaningul confidence estimation.Since it cannot distinguish between which ques-tions it is more or less likely to get correct, its pre-cision is constant or all percent attempted. Devel-

oping more accurate confidence estimation meansa system can deliver ar higher precision even withthe same overall accuracy.

The Competition:Human Champion Perormance

A compelling and scientifically appealing aspect o

the Jeopardy Challenge is the human reerencepoint. Figure 3 contains a graph that illustratesexpert human perormance onJeopardyIt is basedon our analysis o nearly 2000 historical Jeopardy

games. Each point on the graph represents the per-ormance o the winner in oneJeopardygame.2 Asin figure 2, the x-axis o the graph, labeled %

Answered, represents the percentage o questions

the winner answered, and they-axis o the graph,

labeled Precision, represents the percentage othose questions the winner answered correctly.

In contrast to the system evaluation shown in

figure 2, which can display a curve over a range o

confidence thresholds, the human perormance

shows only a single point per game based on the

observed precision and percent answered the win-

ner demonstrated in the game. A urther distinc-

tion is that in these historical games the human

contestants did not have the liberty to answer all

questions they wished. Rather the percent

answered consists o those questions or which the

winner was confident and ast enough to beat the

competition to the buzz. The system perormance

graphs shown in this paper are ocused on evalu-ating QA perormance, and so do not take into

account competition or the buzz. Human per-

ormance helps to position our systems perorm-

ance, but obviously, in a Jeopardygame, perorm-

ance will be aected by competition or the buzz

and this will depend in large part on how quickly

a player can compute an accurate confidence and

how the player manages risk.

Articles

FALL 2010 65

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Precision

% Answered

Figure 3. Champion Human Perormance at Jeopardy.


8/21

The center o what we call the Winners Cloud(the set o light gray dots in the graph in figures 3and 4) reveals that Jeopardychampions are confi-dent and ast enough to acquire on averagebetween 40 percent and 50 percent o all the ques-tions rom their competitors and to perorm withbetween 85 percent and 95 percent precision.

The darker dots on the graph represent Ken Jen-

ningss games. Ken Jennings had an unequaledwinning streak in 2004, in which he won 74 gamesin a row. Based on our analysis o those games, heacquired on average 62 percent o the questionsand answered with 92 percent precision. Humanperormance at this task sets a very high bar orprecision, confidence, speed, and breadth.

Baseline Perormance

Our metrics and baselines are intended to give usconfidence that new methods and algorithms areimproving the system or to inorm us when theyare not so that we can adjust research priorities.

Our most obvious baseline is the QA systemcalled Practical Intelligent Question AnsweringTechnology (PIQUANT) (Prager, Chu-Carroll, andCzuba 2004), which had been under developmentat IBM Research by a our-person team or 6 yearsprior to taking on the JeopardyChallenge. At thetime it was among the top three to five TextRetrieval Conerence (TREC) QA systems. Devel-oped in part under the U.S. government AQUAINTprogram3 and in collaboration with external teamsand universities, PIQUANT was a classic QApipeline with state-o-the-art techniques aimedlargely at the TREC QA evaluation (Voorhees andDang 2005). PIQUANT perormed in the 33 per-cent accuracy range in TREC evaluations. Whilethe TREC QA evaluation allowed the use o theweb, PIQUANT ocused on question answeringusing local resources. A requirement o theJeopardy

Challenge is that the system be sel-contained anddoes not link to live web search.

The requirements o the TREC QA evaluationwere dierent than or the Jeopardy challenge.Most notably, TREC participants were given a rela-tively small corpus (1M documents) rom whichanswers to questions must be justified; TREC ques-tions were in a much simpler orm compared to

Jeopardyquestions, and the confidences associatedwith answers were not a primary metric. Further-more, the systems are allowed to access the weband had a week to produce results or 500 ques-tions. The reader can find details in the TREC pro-ceedings4 and numerous ollow-on publications.

An initial 4-week eort was made to adaptPIQUANT to the JeopardyChallenge. The experi-ment ocused on precision and confidence. Itignored issues o answering speed and aspects othe game like betting and clue values.

The questions used were 500 randomly sampledJeopardyclues rom episodes in the past 15 years.The corpus that was used contained, but did notnecessarily justiy, answers to more than 90 per-cent o the questions. The result o the PIQUANTbaseline experiment is illustrated in figure 4. Asshown, on the 5 percent o the clues that PI-QUANT was most confident in (let end o the

curve), it delivered 47 percent precision, and overall the clues in the set (right end o the curve), itsprecision was 13 percent. Clearly the precision andconfidence estimation are ar below the require-ments o theJeopardyChallenge.

A similar baseline experiment was perormed incollaboration with Carnegie Mellon University

(CMU) using OpenEphyra,5 an open-source QAramework developed primarily at CMU. Theramework is based on the Ephyra system, whichwas designed or answering TREC questions. In ourexperiments on TREC 2002 data, OpenEphyraanswered 45 percent o the questions correctlyusing a live web search.

We spent minimal eort adapting OpenEphyra,but like PIQUANT, its perormance on Jeopardy

clues was below 15 percent accuracy. OpenEphyradid not produce reliable confidence estimates andthus could not eectively choose to answer ques-tions with higher confidence. Clearly a largerinvestment in tuning and adapting these baselinesystems toJeopardywould improve their perorm-

ance; however, we limited this investment sincewe did not want the baseline systems to becomesignificant eorts.

The PIQUANT and OpenEphyra baselinesdemonstrate the perormance o state-o-the-artQA systems on the Jeopardy task. In figure 5 we

show two other baselines that demonstrate theperormance o two complementary approacheson this task. The light gray line shows the per-ormance o a system based purely on text search,using terms in the question as queries and searchengine scores as confidences or candidate answersgenerated rom retrieved document titles. Theblack line shows the perormance o a system

based on structured data, which attempts to lookthe answer up in a database by simply finding thenamed entities in the database related to thenamed entities in the clue. These two approacheswere adapted to theJeopardytask, including iden-tiying and integrating relevant content.

The results orm an interesting comparison. The

search-based system has better perormance at 100percent answered, suggesting that the natural lan-guage content and the shallow text search tech-niques delivered better coverage. However, the flat-ness o the curve indicates the lack o accurateconfidence estimation.6 The structured approachhad better inormed confidence when it was able

to decipher the entities in the question and ound

Articles

66 AI MAGAZINE


9/21

the right matches in its structured knowledge

bases, but its coverage quickly drops o when

asked to answer more questions. To be a high-per-orming question-answering system, DeepQA must

demonstrate both these properties to achieve high

precision, high recall, and an accurate confidence

estimation.

The DeepQA Approach

Early on in the project, attempts to adapt

PIQUANT (Chu-Carroll et al. 2003) ailed to pro-

duce promising results. We devoted many months

o eort to encoding algorithms rom the litera-

ture. Our investigations ran the gamut rom deep

logical orm analysis to shallow machine-transla-

tion-based approaches. We integrated them intothe standard QA pipeline that went rom question

analysis and answer type determination to search

and then answer selection. It was dificult, howev-

er, to find examples o how published research

results could be taken out o their original context

and eectively replicated and integrated into di-

erent end-to-end systems to produce comparable

results. Our eorts ailed to have significant impact

onJeopardyor even on prior baseline studies using

TREC data.

We ended up overhauling nearly everything wedid, including our basic technical approach, the

underlying architecture, metrics, evaluation proto-

cols, engineering practices, and even how we

worked together as a team. We also, in cooperation

with CMU, began the Open Advancement o Ques-

tion Answering (OAQA) initiative. OAQA is

intended to directly engage researchers in the com-

munity to help replicate and reuse research results

and to identiy how to more rapidly advance the

state o the art in QA (Ferrucci et al 2009).

As our results dramatically improved, we

observed that system-level advances allowing rap-

id integration and evaluation o new ideas and

new components against end-to-end metrics wereessential to our progress. This was echoed at the

OAQA workshop or experts with decades o

investment in QA, hosted by IBM in early 2008.

Among the workshop conclusions was that QA

would benefit rom the collaborative evolution o

a single extensible architecture that would allow

component results to be consistently evaluated in

a common technical context against a growing

Articles

FALL 2010 67

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Precision

% Answered

Figure 4. Baseline Perormance.


10/21

Articles

68 AI MAGAZINE

variety o what were called Challenge Problems.

Dierent challenge problems were identified to

address various dimensions o the general QAproblem.Jeopardywas described as one addressing

dimensions including high precision, accurate

confidence determination, complex language,

breadth o domain, and speed.

The system we have built and are continuing to

develop, called DeepQA, is a massively parallel

probabilistic evidence-based architecture. For the

JeopardyChallenge, we use more than 100 dier-

ent techniques or analyzing natural language,

identiying sources, finding and generating

hypotheses, finding and scoring evidence, and

merging and ranking hypotheses. What is ar more

important than any particular technique we use is

how we combine them in DeepQA such that over-

lapping approaches can bring their strengths to

bear and contribute to improvements in accuracy,

confidence, or speed.

DeepQA is an architecture with an accompany-

ing methodology, but it is not specific to theJeop-

ardy Challenge. We have successully applied

DeepQA to both the Jeopardyand TREC QA task.

We have begun adapting it to dierent business

applications and additional exploratory challenge

problems including medicine, enterprise search,

and gaming.The overarching principles in DeepQA are mas-

sive parallelism, many experts, pervasive confi-

dence estimation, and integration o shallow and

deep knowledge.

Massive parallelism: Exploit massive parallelism

in the consideration o multiple interpretations

and hypotheses.

Many experts: Facilitate the integration, applica-

tion, and contextual evaluation o a wide range o

loosely coupled probabilistic question and content

analytics.

Pervasive confidence estimation: No component

commits to an answer; all components produce

eatures and associated confidences, scoring dier-

ent question and content interpretations. An

underlying confidence-processing substrate learns

how to stack and combine the scores.

Integrate shallow and deep knowledge: Balance the

use o strict semantics and shallow semantics,

leveraging many loosely ormed ontologies.

Figure 6 illustrates the DeepQA architecture at a

very high level. The remaining parts o this section

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Precision

% Answered

Figure 5. Text Search Versus Knowledge Base Search.


11/21

provide a bit more detail about the various archi-tectural roles.

Content Acquisition

The first step in any application o DeepQA to

solve a QA problem is content acquisition, or iden-tiying and gathering the content to use or theanswer and evidence sources shown in figure 6.

Content acquisition is a combination o manu-al and automatic steps. The first step is to analyzeexample questions rom the problem space to pro-duce a description o the kinds o questions thatmust be answered and a characterization o theapplication domain. Analyzing example questionsis primarily a manual task, while domain analysismay be inormed by automatic or statistical analy-ses, such as the LAT analysis shown in figure 1.Given the kinds o questions and broad domain othe Jeopardy Challenge, the sources or Watson

include a wide range o encyclopedias, dictionar-ies, thesauri, newswire articles, literary works, andso on.

Given a reasonable baseline corpus, DeepQAthen applies an automatic corpus expansionprocess. The process involves our high-level steps:(1) identiy seed documents and retrieve relateddocuments rom the web; (2) extract sel-containedtext nuggets rom the related web documents; (3)score the nuggets based on whether they are

inormative with respect to the original seed docu-ment; and (4) merge the most inormative nuggetsinto the expanded corpus. The live system itseluses this expanded corpus and does not haveaccess to the web during play.

In addition to the content or the answer andevidence sources, DeepQA leverages other kinds osemistructured and structured content. Anotherstep in the content-acquisition process is to identi-y and collect these resources, which include data-bases, taxonomies, and ontologies, such as dbPe-dia,7 WordNet (Miller 1995), and the Yago8

ontology.

Question Analysis

The first step in the run-time question-answeringprocess is question analysis. During questionanalysis the system attempts to understand whatthe question is asking and perorms the initial

analyses that determine how the question will beprocessed by the rest o the system. The DeepQAapproach encourages a mixture o experts at thisstage, and in the Watson system we produce shal-low parses, deep parses (McCord 1990), logicalorms, semantic role labels, coreerence, relations,named entities, and so on, as well as specific kindso analysis or question answering. Most o thesetechnologies are well understood and are not dis-cussed here, but a ew require some elaboration.

Articles

FALL 2010 69

Figure 6. DeepQA High-Level Architecture.


12/21

Question Classifcation. Question classifcation isthe task o identiying question types or parts oquestions that require special processing. This caninclude anything rom single words with poten-tially double meanings to entire clauses that havecertain syntactic, semantic, or rhetorical unction-ality that may inorm downstream componentswith their analysis. Question classifcation may

identiy a question as a puzzle question, a mathquestion, a defnition question, and so on. It willidentiy puns, constraints, defnition components,or entire subclues within questions.

Focus and LAT Detection. As discussed earlier, alexical answer type is a word or noun phrase in thequestion that specifes the type o the answer with-out any attempt to understand its semantics.Determining whether or not a candidate answercan be considered an instance o the LAT is animportant kind o scoring and a common source ocritical errors. An advantage to the DeepQAapproach is to exploit many independently devel-oped answer-typing algorithms. However, many o

these algorithms are dependent on their own typesystems. We ound the best way to integrate pre-existing components is not to orce them into asingle, common type system, but to have themmap rom the LAT to their own internal types.

The ocus o the question is the part o the ques-tion that, i replaced by the answer, makes the ques-tion a stand-alone statement. Looking back at someo the examples shown previously, the ocus oWhen hit by electrons, a phosphor gives o elec-tromagnetic energy in this orm is this orm; theocus o Secretary Chase just submitted this to meor the third time; guess what, pal. This time Imaccepting it is the first this; and the ocus o

This title character was the crusty and tough cityeditor o theLos Angeles Tribune is This title char-acter. The ocus oten (but not always) containsuseul inormation about the answer, is oten thesubject or object o a relation in the clue, and canturn a question into a actual statement whenreplaced with a candidate, which is a useul way togather evidence about a candidate.

Relation Detection. Most questions contain rela-tions, whether they are syntactic subject-verb-object predicates or semantic relationshipsbetween entities. For example, in the question,Theyre the two states you could be reentering iyoure crossing Floridas northern border, we candetect the relation borders(Florida,?x,north).

Watson uses relation detection throughout theQA process, rom ocus and LAT determination, topassage and answer scoring. Watson can also usedetected relations to query a triple store and direct-ly generate candidate answers. Due to the breadtho relations in theJeopardydomain and the varietyo ways in which they are expressed, however,Watsons current ability to eectively use curated

databases to simply look up the answers is limit-ed to ewer than 2 percent o the clues.

Watsons use o existing databases depends onthe ability to analyze the question and detect therelations covered by the databases. InJeopardythebroad domain makes it dificult to identiy themost lucrative relations to detect. In 20,000 Jeop-ardyquestions, or example, we ound the distri-

bution o Freebase9 relations to be extremely flat(figure 7). Roughly speaking, even achieving highrecall on detecting the most requent relations inthe domain can at best help in about 25 percent othe questions, and the benefit o relation detectiondrops o ast with the less requent relations.Broad-domain relation detection remains a majoropen area o research.

Decomposition. As discussed above, an importantrequirement driven by analysis oJeopardy clueswas the ability to handle questions that are betteranswered through decomposition. DeepQA usesrule-based deep parsing and statistical classifca-tion methods both to recognize whether questionsshould be decomposed and to determine how bestto break them up into subquestions. The operatinghypothesis is that the correct question interpreta-tion and derived answer(s) will score higher aterall the collected evidence and all the relevant algo-rithms have been considered. Even i the questiondid not need to be decomposed to determine ananswer, this method can help improve the systemsoverall answer confdence.

DeepQA solves parallel decomposable questionsthrough application o the end-to-end QA systemon each subclue and synthesizes the final answersby a customizable answer combination compo-nent. These processing paths are shown in medi-um gray in figure 6. DeepQA also supports nesteddecomposable questions through recursive appli-cation o the end-to-end QA system to the innersubclue and then to the outer subclue. The cus-tomizable synthesis components allow specializedsynthesis algorithms to be easily plugged into acommon ramework.

Hypothesis Generation

Hypothesis generation takes the results o questionanalysis and produces candidate answers bysearching the systems sources and extractinganswer-sized snippets rom the search results. Eachcandidate answer plugged back into the question isconsidered a hypothesis, which the system has toprove correct with some degree o confidence.

We reer to search perormed in hypothesis gen-eration as primary search to distinguish it romsearch perormed during evidence gathering(described below). As with all aspects o DeepQA,we use a mixture o dierent approaches or pri-mary search and candidate generation in the Wat-son system.

Articles

70 AI MAGAZINE


13/21

Primary Search. In primary search the goal is to

fnd as much potentially answer-bearing content

as possible based on the results o question analy-

sisthe ocus is squarely on recall with the expec-

tation that the host o deeper content analyticswill extract answer candidates and score this con-

tent plus whatever evidence can be ound in sup-

port or reutation o candidates to drive up the pre-

cision. Over the course o the project we continued

to conduct empirical studies designed to balance

speed, recall, and precision. These studies allowed

us to regularly tune the system to fnd the number

o search results and candidates that produced the

best balance o accuracy and computational

resources. The operative goal or primary search

eventually stabilized at about 85 percent binary

recall or the top 250 candidates; that is, the system

generates the correct answer as a candidate answer

or 85 percent o the questions somewhere within

the top 250 ranked candidates.

A variety o search techniques are used, includ-

ing the use o multiple text search engines with di-

erent underlying approaches (or example, Indri

and Lucene), document search as well as passage

search, knowledge base search using SPARQL on

triple stores, the generation o multiple search

queries or a single question, and backfilling hit

lists to satisy key constraints identified in thequestion.

Triple store queries in primary search are basedon named entities in the clue; or example, find all

database entities related to the clue entities, orbased on more ocused queries in the cases that asemantic relation was detected. For a small numbero LATs we identified as closed LATs, the candi-date answer can be generated rom a fixed list insome store o known instances o the LAT, such asU.S. President or Country.

Candidate Answer Generation. The search resultseed into candidate generation, where techniquesappropriate to the kind o search results are appliedto generate candidate answers. For documentsearch results rom title-oriented resources, thetitle is extracted as a candidate answer. The systemmay generate a number o candidate answer vari-ants rom the same title based on substring analy-sis or link analysis (i the underlying source con-tains hyperlinks). Passage search results requiremore detailed analysis o the passage text to iden-tiy candidate answers. For example, named entitydetection may be used to extract candidateanswers rom the passage. Some sources, such as atriple store and reverse dictionary lookup, producecandidate answers directly as their search result.

Articles

FALL 2010 71

0.00%

0.50%

1.00%

1.50%

2.00%

2.50%

3.00%

3.50%

4.00%

Figure 7: Approximate Distribution o the 50 Most FrequentlyOccurring Freebase Relations in 20,000 Randomly SelectedJeopardy Clues.


14/21

I the correct answer(s) are not generated at thisstage as a candidate, the system has no hope oanswering the question. This step thereore signifi-cantly avors recall over precision, with the expec-tation that the rest o the processing pipeline willtease out the correct answer, even i the set o can-didates is quite large. One o the goals o the sys-tem design, thereore, is to tolerate noise in the

early stages o the pipeline and drive up precisiondownstream.

Watson generates several hundred candidateanswers at this stage.

Sot Filtering

A key step in managing the resource versus preci-sion trade-o is the application o lightweight (lessresource intensive) scoring algorithms to a largerset o initial candidates to prune them down to asmaller set o candidates beore the more intensivescoring components see them. For example, alightweight scorer may compute the likelihood oa candidate answer being an instance o the LAT.

We call this step sot filtering.The system combines these lightweight analysis

scores into a sot filtering score. Candidate answersthat pass the sot filtering threshold proceed tohypothesis and evidence scoring, while those can-didates that do not pass the filtering threshold arerouted directly to the final merging stage. The sotfiltering scoring model and filtering threshold aredetermined based on machine learning over train-ing data.

Watson currently lets roughly 100 candidates passthe sot filter, but this a parameterizable unction.

Hypothesis and Evidence Scoring

Candidate answers that pass the sot filteringthreshold undergo a rigorous evaluation processthat involves gathering additional supporting evi-dence or each candidate answer, or hypothesis,and applying a wide variety o deep scoring ana-lytics to evaluate the supporting evidence.

Evidence Retrieval. To better evaluate each candi-date answer that passes the sot flter, the systemgathers additional supporting evidence. The archi-tecture supports the integration o a variety o evi-dence-gathering techniques. One particularlyeective technique is passage search where thecandidate answer is added as a required term to theprimary search query derived rom the question.

This will retrieve passages that contain the candi-date answer used in the context o the originalquestion terms. Supporting evidence may alsocome rom other sources like triple stores. Theretrieved supporting evidence is routed to the deepevidence scoring components, which evaluate thecandidate answer in the context o the supportingevidence.

Scoring. The scoring step is where the bulk o the

deep content analysis is perormed. Scoring algo-

rithms determine the degree o certainty thatretrieved evidence supports the candidate answers.The DeepQA ramework supports and encouragesthe inclusion o many dierent components, orscorers, that consider dierent dimensions o theevidence and produce a score that corresponds tohow well evidence supports a candidate answer or

a given question.DeepQA provides a common ormat or the scor-

ers to register hypotheses (or example candidateanswers) and confidence scores, while imposingew restrictions on the semantics o the scoresthemselves; this enables DeepQA developers torapidly deploy, mix, and tune components to sup-

port each other. For example, Watson employsmore than 50 scoring components that producescores ranging rom ormal probabilities to countsto categorical eatures, based on evidence rom di-erent types o sources including unstructured text,semistructured text, and triple stores. These scorersconsider things like the degree o match between a

passages predicate-argument structure and thequestion, passage source reliability, geospatial loca-tion, temporal relationships, taxonomic classifica-tion, the lexical and semantic relations the candi-date is known to participate in, the candidatescorrelation with question terms, its popularity (orobscurity), its aliases, and so on.

Consider the question, He was presidentially

pardoned on September 8, 1974; the correctanswer, Nixon, is one o the generated candi-dates. One o the retrieved passages is Ford par-doned Nixon on Sept. 8, 1974. One passage scor-er counts the number o IDF-weighted terms incommon between the question and the passage.

Another passage scorer based on the Smith-Water-man sequence-matching algorithm (Smith andWaterman 1981), measures the lengths o thelongest similar subsequences between the questionand passage (or example on Sept. 8, 1974). Athird type o passage scoring measures the align-ment o the logical orms o the question and pas-sage. A logical orm is a graphical abstraction o

text in which nodes are terms in the text and edgesrepresent either grammatical relationships (orexample, Hermjakob, Hovy, and Lin [2000];Moldovan et al. [2003]), deep semantic relation-ships (or example, Lenat [1995], Paritosh and For-bus [2005]), or both . The logical orm alignmentidentifies Nixon as the object o the pardoning inthe passage, and that the question is asking or the

object o a pardoning. Logical orm alignmentgives Nixon a good score given this evidence. Incontrast, a candidate answer like Ford wouldreceive near identical scores to Nixon or termmatching and passage alignment with this passage,but would receive a lower logical orm alignmentscore.

Articles

72 AI MAGAZINE


15/21

Another type o scorer uses knowledge in triplestores, simple reasoning such as subsumption anddisjointness in type taxonomies, geospatial, andtemporal reasoning. Geospatial reasoning is usedin Watson to detect the presence or absence o spa-tial relations such as directionality, borders, andcontainment between geoentities. For example, ia question asks or an Asian city, then spatial con-tainment provides evidence that Beijing is a suit-able candidate, whereas Sydney is not. Similarly,geocoordinate inormation associated with entitiesis used to compute relative directionality (orexample, Caliornia is SW o Montana; GW Bridgeis N o Lincoln Tunnel, and so on).

Temporal reasoning is used in Watson to detectinconsistencies between dates in the clue andthose associated with a candidate answer. Forexample, the two most likely candidate answersgenerated by the system or the clue, In 1594 hetook a job as a tax collector in Andalusia, areThoreau and Cervantes. In this case, temporalreasoning is used to rule out Thoreau as he was notalive in 1594, having been born in 1817, whereasCervantes, the correct answer, was born in 1547and died in 1616.

Each o the scorers implemented in Watson,how they work, how they interact, and their inde-

pendent impact on Watsons perormance deservesits own research paper. We cannot do this work jus-tice here. It is important to note, however, at thispoint no one algorithm dominates. In act webelieve DeepQAs acility or absorbing these algo-rithms, and the tools we have created or exploringtheir interactions and eects, will represent animportant and lasting contribution o this work.

To help developers and users get a sense o howWatson uses evidence to decide between compet-ing candidate answers, scores are combined into anoverall evidence profile. The evidence profilegroups individual eatures into aggregate evidencedimensions that provide a more intuitive view othe eature group. Aggregate evidence dimensionsmight include, or example, Taxonomic, Geospa-tial (location), Temporal, Source Reliability, Gen-der, Name Consistency, Relational, Passage Sup-port, Theory Consistency, and so on. Eachaggregate dimension is a combination o relatedeature scores produced by the specific algorithmsthat fired on the gathered evidence.

Consider the ollowing question: Chile shares itslongest land border with this country. In figure 8we see a comparison o the evidence profiles ortwo candidate answers produced by the system orthis question: Argentina and Bolivia. Simple search

Articles

FALL 2010 73

-0.2

0

0.2

0.4

0.6

0.8

1

Location PassageSupport

Popularity SourceReliability

Taxonomic

Argentina Bolivia

Figure 8. Evidence Profles or Two Candidate Answers.

Dimensions are on the x-axis and relative strength is on they-axis.


16/21

engine scores avor Bolivia as an answer, due to a

popular border dispute that was requently report-ed in the news. Watson preers Argentina (the cor-rect answer) over Bolivia, and the evidence profileshows why. Although Bolivia does have strong

popularity scores, Argentina has strong support inthe geospatial, passage support (or example, align-ment and logical orm graph matching o various

text passages), and source reliability dimensions.

Final Merging and Ranking

It is one thing to return documents that contain

key words rom the question. It is quite another,however, to analyze the question and the contentenough to identiy the precise answer and yetanother to determine an accurate enough confi-dence in its correctness to bet on it. Winning at

Jeopardyrequires exactly that ability.The goal o final ranking and merging is to eval-

uate the hundreds o hypotheses based on poten-tially hundreds o thousands o scores to identiythe single best-supported hypothesis given the evi-

dence and to estimate its confidencethe likeli-hood it is correct.

Answer Merging

Multiple candidate answers or a question may beequivalent despite very dierent surace orms.

This is particularly conusing to ranking tech-niques that make use o relative dierencesbetween candidates. Without merging, rankingalgorithms would be comparing multiple suraceorms that represent the same answer and trying to

discriminate among them. While one line oresearch has been proposed based on boosting con-fidence in similar candidates (Ko, Nyberg, and Luo

2007), our approach is inspired by the observationthat dierent surace orms are oten disparately

supported in the evidence and result in radicallydierent, though potentially complementary,scores. This motivates an approach that mergesanswer scores beore ranking and confidence esti-

mation. Using an ensemble o matching, normal-ization, and coreerence resolution algorithms,Watson identifies equivalent and related hypothe-ses (or example, Abraham Lincoln and HonestAbe) and then enables custom merging per eature

to combine scores.

Ranking and Confdence Estimation

Ater merging, the system must rank the hypothe-ses and estimate confidence based on their mergedscores. We adopted a machine-learning approach

that requires running the system over a set o train-ing questions with known answers and training amodel based on the scores. One could assume avery flat model and apply existing ranking algo-rithms (or example, Herbrich, Graepel, and Ober-

mayer [2000]; Joachims [2002]) directly to these

score profiles and use the ranking score or confi-dence. For more intelligent ranking, however,ranking and confidence estimation may be sepa-rated into two phases. In both phases sets o scoresmay be grouped according to their domain (orexample type matching, passage scoring, and soon.) and intermediate models trained usingground truths and methods specific or that task.

Using these intermediate models, the system pro-duces an ensemble o intermediate scores. Moti-vated by hierarchical techniques such as mixtureo experts (Jacobs et al. 1991) and stacked general-ization (Wolpert 1992), a metalearner is trainedover this ensemble. This approach allows or itera-tively enhancing the system with more sophisti-cated and deeper hierarchical models while retain-ing flexibility or robustness and experimentationas scorers are modified and added to the system.

Watsons metalearner uses multiple trainedmodels to handle dierent question classes as, orinstance, certain scores that may be crucial to iden-tiying the correct answer or a actoid question

may not be as useul on puzzle questions.Finally, an important consideration in dealing

with NLP-based scorers is that the eatures theyproduce may be quite sparse, and so accurate con-fidence estimation requires the application o con-fidence-weighted learning techniques. (Dredze,Crammer, and Pereira 2008).

Speed and Scaleout

DeepQA is developed using Apache UIMA,10 aramework implementation o the UnstructuredInormation Management Architecture (Ferrucciand Lally 2004). UIMA was designed to supportinteroperability and scaleout o text and multi-modal analysis applications. All o the componentsin DeepQA are implemented as UIMA annotators.These are sotware components that analyze textand produce annotations or assertions about thetext. Watson has evolved over time and the num-ber o components in the system has reached intothe hundreds. UIMA acilitated rapid componentintegration, testing, and evaluation.

Early implementations o Watson ran on a singleprocessor where it took 2 hours to answer a singlequestion. The DeepQA computation is embarrass-ing parallel, however. UIMA-AS, part o ApacheUIMA, enables the scaleout o UIMA applicationsusing asynchronous messaging. We used UIMA-ASto scale Watson out over 2500 compute cores.UIMA-AS handles all o the communication, mes-saging, and queue management necessary usingthe open JMS standard. The UIMA-AS deploymento Watson enabled competitive run-time latenciesin the 35 second range.

To preprocess the corpus and create ast run-time indices we used Hadoop.11 UIMA annotators

Articles

74 AI MAGAZINE


17/21

were easily deployed as mappers in the Hadoop

map-reduce ramework. Hadoop distributes the

content over the cluster to aord high CPU uti-

lization and provides convenient tools or deploy-

ing, managing, and monitoring the corpus analy-

sis process.

StrategyJeopardy demands strategic game play to match

wits against the best human players. In a typical

Jeopardygame, Watson aces the ollowing strate-

gic decisions: deciding whether to buzz in and

attempt to answer a question, selecting squares

rom the board, and wagering on Daily Doubles

and FinalJeopardy.

The workhorse o strategic decisions is the buzz-

in decision, which is required or every nonDaily

Double clue on the board. This is where DeepQAs

ability to accurately estimate its confidence in its

answer is critical, and Watson considers this confi-

dence along with other game-state actors in mak-

ing the final determination whether to buzz.

Another strategic decision, FinalJeopardywagering,

generally receives the most attention and analysis

rom those interested in game strategy, and there

exists a growing catalogue o heuristics such as

Clavins Rule or the Two-Thirds Rule (Dupee

1998) as well as identification o those critical score

boundaries at which particular strategies may be

used (by no means does this make it easy or rote;

despite this attention, we have ound evidence

that contestants still occasionally make irrational

FinalJeopardybets). Daily Double betting turns out

to be less studied but just as challenging since theplayer must consider opponents scores and pre-

dict the likelihood o getting the question correct

just as in FinalJeopardy. Ater a Daily Double, how-

ever, the game is not over, so evaluation o a wager

requires orecasting the eect it will have on the

distant, final outcome o the game.

These challenges drove the construction o sta-

tistical models o players and games, game-theo-

retic analyses o particular game scenarios and

strategies, and the development and application o

reinorcement-learning techniques or Watson to

learn its strategy or playingJeopardy. Fortunately,

moderate samounts o historical data are availableto serve as training data or learning techniques.

Even so, it requires extremely careul modeling and

game-theoretic evaluation as the game oJeopardy

has incomplete inormation and uncertainty to

model, critical score boundaries to recognize, and

savvy, competitive players to account or. It is a

game where one aulty strategic choice can lose the

entire match.

Status and Results

Ater approximately 3 years o eort by a core algo-rithmic team composed o 20 researchers and sot-ware engineers with a range o backgrounds in nat-ural language processing, inormation retrieval,machine learning, computational linguistics, andknowledge representation and reasoning, we have

driven the perormance o DeepQA to operatewithin the winners cloud on the Jeopardytask, asshown in figure 9. Watsons results illustrated inthis figure were measured over blind test sets con-taining more than 2000Jeopardyquestions.

Ater many nonstarters, by the ourth quarter o2007 we finally adopted the DeepQA architecture.At that point we had all moved out o our privateofices and into a war room setting to dramati-cally acilitate team communication and tight col-laboration. We instituted a host o disciplinedengineering and experimental methodologies sup-ported by metrics and tools to ensure we wereinvesting in techniques that promised significant

impact on end-to-end metrics. Since then, modulosome early jumps in perormance, the progress hasbeen incremental but steady. It is slowing in recentmonths as the remaining challenges prove eithervery dificult or highly specialized and coveringsmall phenomena in the data.

By the end o 2008 we were perorming reason-ably wellabout 70 percent precision at 70 per-cent attempted over the 12,000 question blinddata, but it was taking 2 hours to answer a singlequestion on a single CPU. We brought on a teamspecializing in UIMA and UIMA-AS to scale upDeepQA on a massively parallel high-perormancecomputing platorm. We are currently answering

more than 85 percent o the questions in 5 secondsor lessast enough to provide competitive per-ormance, and with continued algorithmic devel-opment are perorming with about 85 percent pre-cision at 70 percent attempted.

We have more to do in order to improve preci-sion, confidence, and speed enough to competewith grand champions. We are finding great resultsin leveraging the DeepQA architecture capabilityto quickly admit and evaluate the impact o newalgorithms as we engage more university partner-ships to help meet the challenge.

An Early Adaptation Experiment

Another challenge or DeepQA has been to demon-strate i and how it can adapt to other QA tasks. Inmid-2008, ater we had populated the basic archi-tecture with a host o components or searching,evidence retrieval, scoring, final merging, andranking or the Jeopardy task, IBM collaboratedwith CMU to try to adapt DeepQA to the TREC QAproblem by plugging in only select domain-spe-cific components previously tuned to the TRECtask. In particular, we added question-analysis

Articles

FALL 2010 75


18/21

components rom PIQUANT and OpenEphyra thatidentiy answer types or a question, and candidateanswer-generation components that identiyinstances o those answer types in the text. TheDeepQA ramework utilized both sets o compo-nents despite their dierent type systems noontology integration was perormed. The identifi-cation and integration o these domain specificcomponents into DeepQA took just a ew weeks.

The extended DeepQA system was applied toTREC questions. Some o DeepQAs answer andevidence scorers are more relevant in the TRECdomain than in theJeopardydomain and others areless relevant. We addressed this aspect o adapta-tion or DeepQAs final merging and ranking bytraining an answer-ranking model using TRECquestions; thus the extent to which each scoreaected the answer ranking and confidence wasautomatically customized or TREC.

Figure 10 shows the results o the adaptationexperiment. Both the 2005 PIQUANT and 2007OpenEphyra systems had less than 50 percentaccuracy on the TREC questions and less than 15

percent accuracy on theJeopardyclues. The Deep-QA system at the time had accuracy above 50 per-cent on Jeopardy. Without adaptation DeepQAsaccuracy on TREC questions was about 35 percent.Ater adaptation, DeepQAs accuracy on TRECexceeded 60 percent. We repeated the adaptationexperiment in 2010, and in addition to theimprovements to DeepQA since 2008, the adapta-tion included a transer learning step or TRECquestions rom a model trained onJeopardyques-tions. DeepQAs perormance on TREC data was 51percent accuracy prior to adaptation and 67 per-cent ater adaptation, nearly level with its per-ormance on blindJeopardydata.

The result perormed significantly better thanthe original complete systems on the task orwhich they were designed. While just one adapta-tion experiment, this is exactly the sort o behav-ior we think an extensible QA system shouldexhibit. It should quickly absorb domain- or task-specific components and get better on that targettask without degradation in perormance in thegeneral case or on prior tasks.

Articles

76 AI MAGAZINE

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Precision

% Answered

Baseline

v0.1 12/07

v0.3 08/08

v0.5 05/09

v0.6 10/09

v0.7 04/10

v0.4 12/08

v0.2 05/08

Figure 9. Watsons Precision and Confdence Progress as o the Fourth Quarter 2009.


19/21

Summary

TheJeopardyChallenge helped us address require-ments that led to the design o the DeepQA archi-tecture and the implementation o Watson. Ater 3years o intense research and development by a coreteam o about 20 researcherss, Watson is perorm-ing at human expert levels in terms o precision,confidence, and speed at theJeopardyquiz show.

Our results strongly suggest that DeepQA is aneective and extensible architecture that may beused as a oundation or combining, deploying,evaluating, and advancing a wide range o algorith-

mic techniques to rapidly advance the field o QA.The architecture and methodology developed aspart o this project has highlighted the need totake a systems-level approach to research in QA,and we believe this applies to research in thebroader field o AI. We have developed many di-erent algorithms or addressing dierent kinds oproblems in QA and plan to publish many o themin more detail in the uture. However, no one algo-rithm solves challenge problems like this. End-to-

end systems tend to involve many complex and

oten overlapping interactions. A system design

and methodology that acilitated the eficient inte-

gration and ablation studies o many probabilistic

components was essential or our success to date.

The impact o any one algorithm on end-to-end

perormance changed over time as other tech-

niques were added and had overlapping eects.

Our commitment to regularly evaluate the eects

o specific techniques on end-to-end perormance,

and to let that shape our research investment, was

necessary or our rapid progress.

Rapid experimentation was another critical

ingredient to our success. The team conductedmore than 5500 independent experiments in 3

yearseach averaging about 2000 CPU hours and

generating more than 10 GB o error-analysis data.

Without DeepQAs massively parallel architecture

and a dedicated high-perormance computing

inrastructure, we would not have been able to per-

orm these experiments, and likely would not have

even conceived o many o them.

Articles

FALL 2010 77

0%

10%

20%

30%

40%

50%

60%

70%

PIQUANT Ephyra DeepQA

Jeopardy!

TREC

DeepQA performs onBOTH tasks 9/2008

IBMs 2005TREC QA

DeepQA priorto Adaptation

CMUs 2007TREC QA

Figure 10. Accuracy onJeopardy! and TREC.


20/21

Tuned or the Jeopardy Challenge, Watson has

begun to compete against ormerJeopardyplayersin a series o sparring games. It is holding itsown, winning 64 percent o the games, but has tobe improved and sped up to compete avorablyagainst the very best.

We have leveraged our collaboration with CMUand with our other university partnerships in get-

ting this ar and hope to continue our collabora-tive work to drive Watson to its final goal, and helpopenly advance QA research.

Acknowledgements

We would like to acknowledge the talented team

o research scientists and engineers at IBM and atpartner universities, listed below, or the incrediblework they are doing to influence and develop allaspects o Watson and the DeepQA architecture. Itis this team who are responsible or the workdescribed in this paper. From IBM, Andy Aaron,Einat Amitay, Branimir Boguraev, David Carmel,Arthur Ciccolo, Jaroslaw Cwiklik, Pablo Duboue,

Edward Epstein, Raul Fernandez, Radu Florian,Dan Gruhl, Tong-Haing Fin, Achille Fokoue, KarenIngraea, Bhavani Iyer, Hiroshi Kanayama, Jon

Lenchner, Anthony Levas, Burn Lewis, MichaelMcCord, Paul Morarescu, Matthew Mulholland,Yuan Ni, Miroslav Novak, Yue Pan, Siddharth Pat-wardhan, Zhao Ming Qiu, Salim Roukos, MarshallSchor, Dana Sheinwald, Roberto Sicconi, HiroshiKanayama, Kohichi Takeda, Gerry Tesauro, ChenWang, Wlodek Zadrozny, and Lei Zhang. From ouracademic partners, Manas Pathak (CMU), ChangWang (University o Massachusetts [UMass]), Hide-ki Shima (CMU), James Allen (UMass), Ed Hovy(University o Southern Caliornia/Inormation

Sciences Instutute), Bruce Porter (University oTexas), Pallika Kanani (UMass), Boris Katz (Massa-chusetts Institute o Technology), Alessandro Mos-chitti, and Giuseppe Riccardi (University o Tren-to), Barbar Cutler, Jim Hendler, and SelmerBringsjord (Rensselaer Polytechnic Institute).

Notes1. Watson is named ater IBMs ounder, Thomas J. Wat-

son.

2. Random jitter has been added to help visualize the dis-

tribution o points.

3. www-nlpir.nist.gov/projects/aquaint.

4. trec.nist.gov/proceedings/proceedings.html.

5. sourceorge.net/projects/openephyra/.

6. The dip at the let end o the light gray curve is due to

the disproportionately high score the search engine

assigns to short queries, which typically are not suf-

ciently discriminative to retrieve the correct answer in

top position.

7. dbpedia.org/.

8. www.mpi-in.mpg.de/yago-naga/yago/.

9. reebase.com/.

10. incubator.apache.org/uima/.

11. hadoop.apache.org/.

ReerencesChu-Carroll, J.; Czuba, K.; Prager, J. M.; and Ittycheriah,

A. 2003. Two Heads Are Better Than One in Question-

Answering. Paper presented at the Human Language

Technology Conerence, Edmonton, Canada, 27 May1

June.Dredze, M.; Crammer, K.; and Pereira, F. 2008. Conf-

dence-Weighted Linear Classifcation. InProceedings o the

Twenty-Fith International Conerence on Machine Learning

(ICML). Princeton, NJ: International Machine Learning

Society.

Dupee, M. 1998.How to Get on Jeopardy! ... and Win: Valu-

able Inormation rom a Champion. Secaucus, NJ: Citadel

Press.

Ferrucci, D., and Lally, A. 2004. UIMA: An Architectural

Approach to Unstructured Inormation Processing in the

Corporate Research Environment. Natural Langage Engi-

neering10(34): 327348.

Ferrucci, D.; Nyberg, E.; Allan, J.; Barker, K.; Brown, E.;

Chu-Carroll, J.; Ciccolo, A.; Duboue, P.; Fan, J.; Gondek,

D.; Hovy, E.; Katz, B.; Lally, A.; McCord, M.; Morarescu,P.; Murdock, W.; Porter, B.; Prager, J.; Strzalkowski, T.;

Welty, W.; and Zadrozny, W. 2009. Towards the Open

Advancement o Question Answer Systems. IBM Techni-

cal Report RC24789, Yorktown Heights, NY.

Herbrich, R.; Graepel, T.; and Obermayer, K. 2000. Large

Margin Rank Boundaries or Ordinal Regression. In

Advances in Large Margin Classifers, 115132. Linkping,

Sweden: Liu E-Press.

Hermjakob, U.; Hovy, E. H.; and Lin, C. 2000. Knowl-

edge-Based Question Answering. In Proceedings o the

Sixth World Multiconerence on Systems, Cybernetics, and

Inormatics (SCI-2002). Winter Garden, FL: International

Institute o Inormatics and Systemics.

Hsu, F.-H. 2002. Behind Deep Blue: Building the Computer

That Deeated the World Chess Champion. Princeton, NJ:

Princeton University Press.

Jacobs, R.; Jordan, M. I.; Nowlan. S. J.; and Hinton, G. E.

1991. Adaptive Mixtures o Local Experts. Neural Compu-

tation 3(1): 79-87.

Joachims, T. 2002. Optimizing Search Engines Using

Clickthrough Data. InProceedings o the Thirteenth ACM

Conerence on Knowledge Discovery and Data Mining (KDD).

New York: Association or Computing Machinery.

Ko, J.; Nyberg, E.; and Luo Si, L. 2007. A Probabilistic

Graphical Model or Joint Answer Ranking in Question

Answering. InProceedings o the 30th Annual International

ACM SIGIR Conerence, 343350. New York: Association

or Computing Machinery.

Lenat, D. B. 1995. Cyc: A Large-Scale Investment inKnowledge Inrastructure. Communications o the ACM

38(11): 3338.

Maybury, Mark, ed. 2004. New Directions in Question-

Answering. Menlo Park, CA: AAAI Press.

McCord, M. C. 1990. Slot Grammar: A System or Sim-

pler Construction o Practical Natural Language Gram-

mars. In Natural Language and Logic: International Scientifc

Symposium. Lecture Notes in Computer Science 459.

Berlin: Springer Verlag.

Articles

78 AI MAGAZINE


21/21

Miller, G. A. 1995. WordNet: A Lexical Database or Eng-lish. Communications o the ACM38(11): 3941.

Moldovan, D.; Clark, C.; Harabagiu, S.; and Maiorano, S.

2003. COGEX: A Logic Prover or Question Answering.

Paper presented at the Human Language TechnologyConerence, Edmonton, Canada, 27 May1 June..

Paritosh, P., and Forbus, K. 2005. Analysis o Strategic

Knowledge in Back o the Envelope Reasoning. InPro-

ceedings o the 20th AAAI Conerence on Artifcial Intelligence(AAAI-05). Menlo Park, CA: AAAI Press.

Prager, J. M.; Chu-Carroll, J.; and Czuba, K. 2004. A Mul-

ti-Strategy, Multi-Question Approach to Question

Answering. In New Directions in Question-Answering, ed.M. Maybury. Menlo Park, CA: AAAI Press.

Simmons, R. F. 1970. Natural Language Question-

Answering Systems: 1969. Communications o the ACM

13(1): 1530

Smith T. F., and Waterman M. S. 1981. Identifcation oCommon Molecular Subsequences.Journal o Molecular

Biology147(1): 195197.

Strzalkowski, T., and Harabagiu, S., eds. 2006.Advances inOpen-Domain Question-Answering. Berlin: Springer.

Voorhees, E. M., and Dang, H. T. 2005. Overview o the

TREC 2005 Question Answering Track. InProceedings othe Fourteenth Text Retrieval Conerence. Gaithersburg, MD:

National Institute o Standards and Technology.

Wolpert, D. H. 1992. Stacked Generalization. Neural Net-

works 5(2): 241259.

David Ferrucci is a research sta member and leads the

Semantic Analysis and Integration department at theIBM T. J. Watson Research Center, Hawthorne, New York.

Ferrucci is the principal investigator or the

DeepQA/Watson project and the chie architect orUIMA, now an OASIS standard and Apache open-source

project. Ferruccis background is in artifcial intelligence

and sotware engineering.

Eric Brown is a research sta member at the IBM T. J.Watson Research Center. His background is in inorma-

tion retrieval. Browns current research interests include

question answering, unstructured inormation manage-ment architectures, and applications o advanced text

analysis and question answering to inormation retrieval

systems..

Jennier Chu-Carroll is a research sta member at theIBM T. J. Watson Research Center. Chu-Carroll is on the

editorial board o theJournal o Dialogue Systems, and pre-

viously served on the executive board o the North Amer-ican Chapter o the Association or Computational Lin-

guistics and as program cochair o HLT-NAACL 2006. Herresearch interests include question answering, semantic

search, and natural language discourse and dialogue..

James Fan is a research sta member at IBM T. J. Watson

Research Center. His research interests include natural

language processing, question answering, and knowledgerepresentation and reasoning. He has served as a program

committee member or several top ranked AI conerencesand journals, such as IJCAI and AAAI. He received his

Ph.D. rom the University o Texas at Austin in 2006.

David Gondek is a research sta member at the IBM T. J.

Watson Research Center. His research interests include

applications o machine learning, statistical modeling,

and game theory to question answering and natural lan-

guage processing. Gondek has contributed to journals

and conerences in machine learning and data mining.

He earned his Ph.D. in computer science rom Brown

University.

Aditya A. Kalyanpur is a research sta member at the

IBM T. J. Watson Research Center. His primary research

interests include knowledge representation and reason-ing, natural languague programming, and question

answering. He has served on W3 working groups, as pro-

gram cochair o an international semantic web work-

shop, and as a reviewer and program committee member

or several AI journals and conerences. Kalyanpur com-

pleted his doctorate in AI and semantic web related

research rom the University o Maryland, College Park.

Adam Lally is a senior sotware engineer at IBMs T. J.

Watson Research Center. He develops natural language

processing and reasoning algorithms or a variety o

applications and is ocused on developing scalable rame-

works o NLP and reasoning systems. He is a lead devel-

oper and designer or the UIMA ramework and architec-

ture specifcation.

J. William Murdock is a research sta member at the

IBM T. J. Watson Research Center. Beore joining IBM, he

worked at the United States Naval Research Laboratory.

His research interests include natural-language seman-

tics, analogical reasoning, knowledge-based planning,

machine learning, and computational reection. In

2001, he earned his Ph.D. in computer science rom the

Georgia Institute o Technology..

Eric Nybergis a proessor at the Language Technologies

Institute, School o Computer Science, Carnegie Mellon

University. Nybergs research spans a broad range o text

analysis and inormation retrieval areas, including ques-

tion answering, search, reasoning, and natural language

processing architectures, systems, and sotware engineer-

ing principles

John Prager is a research sta member at the IBM T. J.

Watson Research Center in Yorktown Heights, New York.

His background includes natural-language based inter-

aces and semantic search, and his current interest is on

incorporating user and domain models to inorm ques-

tion-answering. He is a member o the TREC program

committee.

Nico Schlaeeris a Ph.D. student at the Language Tech-

nologies Institute in the School o Computer Science,

Carnegie Mellon University and an IBM Ph.D. Fellow. His

research ocus is the application o machine learning

techniques to natural language processing tasks. Schlae-

er is the primary author o the OpenEphyra question

answering system.

Chris Welty is a research sta member at the IBM

Thomas J. Watson Research Center. His background is

primarily in knowledge representation and reasoning.

Weltys current research ocus is on hybridization o

machine learning, natural language processing, and

knowledge representation and reasoning in building AI

systems.

Articles

FALL 2010 79

ferrucci-watson2010 - build watson - an overview of deepqa project

Documents