SECTION 4 HIGHLIGHTS THE MANAGEMENT OF THE INTER-FORMANCE OF A SYSTEM...

Question

6, then TVU receives a cluster adjustment scorelogues, we have conducted three sets of experimentsin order to boost its ranking within its QUABwith human users of FERRET. In these experiments,cluster. We calculate the cluster adjustmentusers were allotted two hours to interact with Ferretscore as)   &BTYU    )    1 3  99to gather information requested by a dialogue sce-  )   , where  represents the differencenario similar to the one presented in Figure 2. Inin rank between the centroid of the cluster andExperiment 1 (E1), 8 U.S. Navy Reserve (USNR)the previous rank of the QUAB questionT U .intelligence analysts used FERRETto research 8 dif-ferent scenarios related to chemical and biologicalIn the currently-implemented version of FERRET,weapons. Experiment 2 and Experiment 3 consid-we used Similarity Metric 5 to automatically iden-ered several of the same scenarios addressed in E1:tify the set of 10 QUAB questions that were mostE2 included 24 mixed teams of analysts and novicesimilar to a user’s question. These question-and-answer pairs were then returned to the user – alongusers working with 2 scenarios, while E3 featured 4with answers from FERRET’s automatic Q/A systemUSNR analysts working with 6 of the original 8 sce-narios. (Details for each experiment are provided in– as potential continuations of the Q/A dialogue. Weused the remaining 6 similarity metrics described inTable 2.) Users were also given a task to focus theirCountry n QUAB User Q Totalresearch; in E1 and E3, users prepared a short report(avg.) (avg.) (avg.)detailing their findings; in E2, users were given a listIndia 2 21.5 13.0 34.5Libya 2 12.0 9.0 21.0of “challenge” questions to answer.Iran 2 18.5 11.0 29.5N.Korea 2 16.5 7.5 34.0Pakistan 2 29.5 15.5 45.0Exp Users QUABs? Scenarios TopicsS.Africa 2 14.5 6.0 20.5E1 8 Yes 8 Egypt BW, Russia CW, SouthRussia 2 13.5 15.5 29.0Africa CW, India CW, NorthEgypt 2 15.0 20.5 35.5Korea CBW, Pakistan CW,TOTAL(E1) 16 17.63 12.25 29.88Libya CW, Iran CWE2 24 Yes 2 Egypt BW, Russia CWTable 4: Efficiency of Dialogues in Experiment 1E3 4 No 6 Egypt BW, Russia CW, NorthKorea CBW, Pakistan CWIndia CW, Libya CW, Iran CWRussia 24 8.2 5.5 13.7Table 2: Experiment detailsEgypt 24 10.8 7.6 18.4TOTAL(E2) 48 9.50 6.55 16.05In E1 and E2, users had access to a total of 3210Table 5: Efficiency of Dialogues in Experiment 2QUAB questions that had been hand-created by de-velopers for each the 8 dialogue scenarios. (Table 3provides totals for each scenario.) In E3, users per-entered by a user: each QUAB pair returned wasformed research with a version of FERRET that in-graded as “relevant” or “irrelevant” to a user ques-cluded no QUABs at all.tion in a forced-choice task. Aggregate relevancescores were used to calculate (1) the percentage ofScenario Handcrafted QUABsrelevant QUAB pairs returned and (2) the mean re-INDIA 460LIBYA 414ciprocal rank (MRR) for each user question. MRR isIRAN 522NORTHKOREA 316defined as +< % F, whree is the lowest rank ofPAKISTAN 322< <any relevant answer for theuser query7. Table 6SOUTHAFRICA 454 ?RUSSIA 366describes the performance of FERRETwhen each ofEGYPT 356Testing Total 3210the 7 similarity measures presented in Section 4 areused to return QUAB pairs in response to a query.Table 3: QUAB distribution over scenariosWhen only answers from FERRET’s automatic Q/AWe have evaluated FERRET by measuring effi-system were available to users, only 15.7% of sys-ciency, effectiveness, and user satisfaction:tem responses were deemed to be relevant to a user’sEfficiency FERRET’s QUAB collection enabledquery. In contrast, when manually-generated QUABusers in our experiments to find more relevant infor-pairs were introduced, as high as 84% of the sys-mation by asking fewer questions. When manually-tem’s responses were deemed to be relevant. Thecreated QUABs were available (E1 and E2), usersresults listed in Table 6 show that the best metric issubmitted an average of 12.25 questions each ses-Similarity Metric 5. Thse results suggest that thesion. When no QUABs were available (E3), usersselection of relevant questions depends on sophis-entered a total of 44.5 questions per session. Table 4ticated similarity measures that rely on conceptuallists the number of QUAB question-answer pairs se-hierarchies and semantic recognizers.lected by users and the number of user questions en-We evaluated the quality of each of the fourtered by users during the 8 scenarios considered insets of automatically-generated QUABs in a sim-E1. In E2, freed from the task of writing a researchilar fashion. For each question submitted by areport, users asked significantly (p 0.05) feweruser in E1, E2, and E3, we collected the top 5questions and selected fewer QUABs than they didQUAB question-answer pairs (as determined byin E1. (See Table 5).Similarity Metric 5) that FERRET returned. As withEffectiveness QUAB question-answer pairs alsothe manually-generated QUABs, the automatically-improved the overall accuracy of the answers re-turned by FERRET. To measure the effectiveness of7We chose MRR as our scoring metric because it reflects thefact that a user is most likely to examine the first few answersa Q/A dialogue, human annotators were used to per-from any system, but that all correct answers returned by theform a post-hoc analysis of how relevant the QUABsystem have some value because users will sometimes examinepairs returned by FERRET were to each questiona very large list of query results.% of Top 5 Responses % of Top 1 Responses MRRmake suggestions to a user of potential relevant con-Relevant to User Q Relevant to User Qtinuations of a discourse. In this paper, we haveWithout QUAB 15.73% 26.85% 0.325Similarity 1 82.61% 60.63% 0.703presented FERRET, an interactive Q/A system whichSimilarity 2 79.95% 58.45% 0.681Similarity 3 79.47% 56.04% 0.664makes use of a novel Q/A architecture that integratesSimilarity 4 78.26% 46.14% 0.592QUAB question-answer pairs into the processing ofSimilarity 5 84.06% 68.36% 0.753Similarity 6 81.64% 56.04% 0.671questions. Experiments with FERRET have shownSimilarity 7 84.54% 64.01% 0.730that, in addition to being rapidly adopted by users asTable 6: Effectiveness of dialogsvalid suggestions, the incorporation of QUABs intoQ/A can greatly improve the overall accuracy of aninteractive Q/A dialogue.generated pairs were submitted to human assessorswho annotated each as “relevant” or irrelevant to theReferencesuser’s query. Aggregate scores are presented in Ta-S. Dudani. 1976. The distance-weighted k-nearest-neighbourble 7.rule. IEEE Transactions on Systems, Man, and Cybernetics,Egypt RussiaSMC-6(4):325–327.Approach % of Top 5 % of Top 5Responses Rel. MRR Responses Rel. MRRS. Harabagiu, D. Moldovan, C. Clark, M. Bowden, J. Williams,to User Q to User Qand J. Bensley. 2003. Answer Mining by Combining Ex-Approach 1 40.01% 0.295 60.25% 0.310traction Techniques with Abductive Reasoning. In Proceed-Approach 2 36.00% 0.243 72.00% 0.475Approach 3 44.62% 0.271 60.00% 0.297ings of the Twelfth Text Retrieval Conference (TREC 2003).Approach 4 68.05% 0.510 68.00% 0.406Sanda Harabagiu. 2004. Incremental Topic Representations.Table 7: Quality of QUABs acquired automaticallyIn Proceedings of the 20th COLING Conference, Geneva,Switzerland.User Satisfaction Users were consistently satis-fied with their interactions with FERRET. In all threeMarti Hearst. 1994. Multi-Paragraph Segmentation of Exposi-tory Text. In Proceedings of the 32nd Meeting of the Associ-experiments, respondents claimed that they foundation for Computational Linguistics, pages 9–16.that FERRET(1) gave meaningful answers, (2) pro-Megumi Kameyama. 1997. Recognizing Referential Links: Anvided useful suggestions, (3) helped answer spe-Information Extraction Perspective. In Workshop of Opera-cific questions, and (4) promoted their general un-tional Factors in Practical, Robust Anaphora Resolution forderstanding of the issues considered in the scenario.Unrestricted Texts, (ACL-97/EACL-97), pages 46–53.Complete results of this study are presented in Ta-Chin-Yew Lin and Eduard Hovy. 2000. The Automated Acqui-ble 88.sition of Topic Signatures for Text Summarization. In Pro-ceedings of the 18th COLING Conference, pages 495–501.Factor E1 E2 E3Promoted understanding 3.40 3.20 3.75S. Lytinen and N. Tomuro. 2002. The Use of Question TypesHelped with specific questions 3.70 3.60 3.25Make good use of questions 3.40 3.55 3.0to Match Questions in FAQFinder. In Papers from the 2002Gave new scenario insights 3.00 3.10 2.2AAAI Spring Symposium on Mining Answers from Texts andGave good collection coverage 3.75 3.70 3.75Knowledge Bases, pages 46–53.Stimulated user thinking 3.50 3.20 2.75Easy to use 3.50 3.55 4.10Srini Narayanan and Sanda Harabagiu. 2004. Question An-Expanded understanding 3.40 3.20 3.00Gave meaningful answers 4.10 3.60 2.75swering Based on Semantic Structures. In Proceedings ofWas helpful 4.00 3.75 3.25the 20th COLING Conference, Geneva, Switzerland.Helped with new search methods 2.75 3.05 2.25Provided novel suggestions 3.25 3.40 2.65Mihai Surdeanu and Sanda M. Harabagiu. 2002. InfratructureIs ready for work environment 2.85 2.80 3.25for open-domanin information extraction. In Conference forWould speed up work 3.25 3.25 3.00Human Language Technology (HLT-2002).Overall like of system 3.75 3.60 3.75Table 8: User Satisfaction Survey ResultsMihai Surdeanu, Sanda M. Harabagiu, John Williams, and PaulAarseth. 2003. Using predicate-argument structures for in-6 Conclusionsformation extraction. In ACL, pages 8–15.Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and SiljaWe believe that the quality of Q/A interactions de-Huttunen. 2000. Automatic Acquisition of Domain Knowl-pends on the modeling of scenario topics. An idealedge for Information Extraction. In Proceedings of the 18thmodel is provided by question-answer databasesCOLING Conference, pages 940–946.(QUABs) that are created off-line and then used toRoman Yangarber. 2003. Counter-Training in Discovery ofSemantic Patterns. In Proceedings of the 41th Meeting of the8Evaluation scale: 1-does not describe the system, 5-Association for Computational Linguistics, pages 343–350.completely describes the system