aquaint testbed john aberdeen, john burger, conrad chang, john henderson, scott mardis the mitre...

AQUAINT Testbed

John Aberdeen, John Burger,Conrad Chang, John Henderson, Scott Mardis

The MITRE Corporation

© 2002, The MITRE Corporation

AQUAINT Activities @ MITRETestbed

- Provide access to Q&A systems on classified data.

- Solicit user feedback (user studies).Testweb

- Provide public access to a broad variety of Q&A capabilities.

- Evaluate systems and architectures for inclusion in testbed.

User Studies- Determine tasks for which Q&A is a useful

technology.- Determine advantages of Q&A over related

information technologies.- Obtain feedback from users on utility and

usability.

Non-AQUAINT studies

AFIWC (Air Force Information Warfare Center)installed both classified & unclassified systemsQANDA configured to answer information security questions on BUGTRAC data

How can CDE be exploited?

IRS studyquestions about completing tax forms

What is required to claim a child as a dependent?

Lessons learned

Answers are only useful in context (long answers are preferred)

Source text must be availableChapter and section headings are important

context

Issues with a classified system- may not be able to know the top-level

objectives of the users- may not be able to be told or record any actual

questions- feedback is largely qualitative

Classified network (ICTESTNET)- access to users, data, scenarios will be restricted

Evaluate systems prior to installation- Testweb becomes more important- MITRE installations are more than rehearsal

To facilitate feedback, initial deployment should use open source data possibly on a different network

Testbed

Testbed Activity

MITRE installations(need to assess portability to the IC environment, maintainability, features, resources, etc.)

- QUIRK (CYCorp/IBM)- Javelin (CMU) - in progress- who’s next?

Support scenario development on CNS Data with Search and Q/A interface. Centralize collection of user questions. Available for analysts, reservists, AQUAINT executive committee members, etc.

Testweb

Clarity measure ISI TextMap integrated into web demoSoon: CNS Data: search + Google APIQANDA on CNS

Q/Asystem

CNS

TREC2002

Javelin, LCC,Qanda, TextMap

Q/APortal/Demo

Q/A repository

Google API

Othercollections

IRservice

Clarityservice

UsersQ/APortal/Demo

System Interoperability

<?xml version="1.0"?><qaResult>

<systemID>ABC</systemID><systemMessage>ABC</systemMessage><systemCode>0</systemCode><question>When did Columbus discover America?</question><response>

<answer><answerString exact="Y”>1492</answerString><answerString exact="Y" boundary="">the previous

year</answerString><justification type="context">Columbus discovered

America in 1492 <see>www.columbus.org</see>

</justification></answer>

</response></qaResult>

Answer Combination

67 systems submitted to TREC-11 main QA task- Including some variants

Average raw submission accuracy was 22%- 28% for loosely correct (judgment {1,2,3})

How well can we do by combining systems in some way?- Simplest approach: voting- More sophisticated approaches?

Basic Approach

Define distance measures between answer pairs

- Generalization of simple voting

- Can use partial matches, other evidence sources Select near-“centroid” of all submissions

- Minimize sum of pairwise distances (SOP)

- Previously used to select DNA sequences (Gusfield 1993) Endless possibilities for distance measures

- Edit distance, ngrams, geo & time distances … Also used document source prior

- NY Times vs. Associated Press vs. Xinhua vs. NIL

Sample Results

Simple voting more than twice as good as average More sophisticated measures even better SOP scores can also be used for confidence ranking

Dev Set (100 Qs) Test Set (400 Qs)Strict Loose Strict

P avgP P avgP P avgPexact string match 50 70 54 74 42 65

word set 54 75 58 78 46 68word bag 54 75 58 78 46 68

character set 51 65 57 67 46 62character bag 60 81 64 85 50 74

word bag w/ doc priors 66 83 74 88 51 72character bag w/ doc priors 64 81 69 86 50 72

5-character bag w/ doc priors,weighted numeric strings

66 85 76 90 53 73

Example: Question 1674

What day did Neil Armstrong land on the moon?

22 different answer strings submitted

- 1969 (plurality of submissions—incorrect)

- July 20, 1969; on July 20, 1969 (correct)

- July 18, 1969; July 14, 1999 …

- 20

- Plus variants differing in punctuation Best-scoring selector chooses correct answer

- Above answers all contribute

Future Work

Did not have access to

- System identity (even anonymized)

- Confidence rankingsWould like to use both

- Simple system-specific priors would be easy

- More sophisticated models possibleBetter confidence estimation

- Should do better than using SOP score directly

Initial User Study:Comparison to traditional IR

Establish a baseline for relative utility via a task-based comparison of Q/A to traditional IR.

Initial task: collect set of geographic, temporal, and monetary facts regarding hurricane Mitch

Data: TREC11

Measure: task completeness, accuracy, time

Analyze logs for query reformulations, documents usage, etc.

Preliminary Results

Initial subjects are MITRE employees

We have run N subjects each on Q/A and IR (Lucene)

<results not ready at time of submission>

What’s Next

Testbed system appraisals

Testweb stability & facelift

Studies with other Q/A systems & featuresOther tasks (based on CNS data)

Other component integrations:- answer combination & summarization

aquaint testbed john aberdeen, john burger, conrad chang, john henderson, scott mardis the mitre...

Documents