aquaint testbed john aberdeen, john burger, conrad chang, john henderson, scott mardis the mitre...
TRANSCRIPT
AQUAINT Testbed
John Aberdeen, John Burger,Conrad Chang, John Henderson, Scott Mardis
The MITRE Corporation
© 2002, The MITRE Corporation
AQUAINT Activities @ MITRETestbed
- Provide access to Q&A systems on classified data.
- Solicit user feedback (user studies).Testweb
- Provide public access to a broad variety of Q&A capabilities.
- Evaluate systems and architectures for inclusion in testbed.
User Studies- Determine tasks for which Q&A is a useful
technology.- Determine advantages of Q&A over related
information technologies.- Obtain feedback from users on utility and
usability.
Non-AQUAINT studies
AFIWC (Air Force Information Warfare Center)installed both classified & unclassified systemsQANDA configured to answer information security questions on BUGTRAC data
How can CDE be exploited?
IRS studyquestions about completing tax forms
What is required to claim a child as a dependent?
Lessons learned
Answers are only useful in context (long answers are preferred)
Source text must be availableChapter and section headings are important
context
Issues with a classified system- may not be able to know the top-level
objectives of the users- may not be able to be told or record any actual
questions- feedback is largely qualitative
Classified network (ICTESTNET)- access to users, data, scenarios will be restricted
Evaluate systems prior to installation- Testweb becomes more important- MITRE installations are more than rehearsal
To facilitate feedback, initial deployment should use open source data possibly on a different network
Testbed
Testbed Activity
MITRE installations(need to assess portability to the IC environment, maintainability, features, resources, etc.)
- QUIRK (CYCorp/IBM)- Javelin (CMU) - in progress- who’s next?
Support scenario development on CNS Data with Search and Q/A interface. Centralize collection of user questions. Available for analysts, reservists, AQUAINT executive committee members, etc.
Testweb
Clarity measure ISI TextMap integrated into web demoSoon: CNS Data: search + Google APIQANDA on CNS
Q/Asystem
CNS
TREC2002
Javelin, LCC,Qanda, TextMap
Q/APortal/Demo
Q/A repository
Google API
Othercollections
IRservice
Clarityservice
UsersQ/APortal/Demo
Demo
System Interoperability
<?xml version="1.0"?><qaResult>
<systemID>ABC</systemID><systemMessage>ABC</systemMessage><systemCode>0</systemCode><question>When did Columbus discover America?</question><response>
<answer><answerString exact="Y”>1492</answerString><answerString exact="Y" boundary="">the previous
year</answerString><justification type="context">Columbus discovered
America in 1492 <see>www.columbus.org</see>
</justification></answer>
</response></qaResult>
Answer Combination
67 systems submitted to TREC-11 main QA task- Including some variants
Average raw submission accuracy was 22%- 28% for loosely correct (judgment {1,2,3})
How well can we do by combining systems in some way?- Simplest approach: voting- More sophisticated approaches?
Basic Approach
Define distance measures between answer pairs
- Generalization of simple voting
- Can use partial matches, other evidence sources Select near-“centroid” of all submissions
- Minimize sum of pairwise distances (SOP)
- Previously used to select DNA sequences (Gusfield 1993) Endless possibilities for distance measures
- Edit distance, ngrams, geo & time distances … Also used document source prior
- NY Times vs. Associated Press vs. Xinhua vs. NIL
Sample Results
Simple voting more than twice as good as average More sophisticated measures even better SOP scores can also be used for confidence ranking
Dev Set (100 Qs) Test Set (400 Qs)Strict Loose Strict
P avgP P avgP P avgPexact string match 50 70 54 74 42 65
word set 54 75 58 78 46 68word bag 54 75 58 78 46 68
character set 51 65 57 67 46 62character bag 60 81 64 85 50 74
word bag w/ doc priors 66 83 74 88 51 72character bag w/ doc priors 64 81 69 86 50 72
5-character bag w/ doc priors,weighted numeric strings
66 85 76 90 53 73
Example: Question 1674
What day did Neil Armstrong land on the moon?
22 different answer strings submitted
- 1969 (plurality of submissions—incorrect)
- July 20, 1969; on July 20, 1969 (correct)
- July 18, 1969; July 14, 1999 …
- 20
- Plus variants differing in punctuation Best-scoring selector chooses correct answer
- Above answers all contribute
Future Work
Did not have access to
- System identity (even anonymized)
- Confidence rankingsWould like to use both
- Simple system-specific priors would be easy
- More sophisticated models possibleBetter confidence estimation
- Should do better than using SOP score directly
Initial User Study:Comparison to traditional IR
Establish a baseline for relative utility via a task-based comparison of Q/A to traditional IR.
Initial task: collect set of geographic, temporal, and monetary facts regarding hurricane Mitch
Data: TREC11
Measure: task completeness, accuracy, time
Analyze logs for query reformulations, documents usage, etc.
Preliminary Results
Initial subjects are MITRE employees
We have run N subjects each on Q/A and IR (Lucene)
<results not ready at time of submission>
What’s Next
Testbed system appraisals
Testweb stability & facelift
Studies with other Q/A systems & featuresOther tasks (based on CNS data)
Other component integrations:- answer combination & summarization