SECTION 4 HIGHLIGHTS THE MANAGEMENT OF THE INTER-FORMANCE OF A SYSTEM...
6, then
TVU
receives a cluster adjustment score
logues, we have conducted three sets of experiments
in order to boost its ranking within its QUAB
with human users of F
ERRET. In these experiments,
cluster. We calculate the cluster adjustment
users were allotted two hours to interact with Ferret
score as
)
&
B
TYU
)
1
3
99
to gather information requested by a dialogue sce-
)
, where
represents the difference
nario similar to the one presented in Figure 2. In
in rank between the centroid of the cluster and
Experiment 1 (E1), 8 U.S. Navy Reserve (USNR)
the previous rank of the QUAB question
T
U
.
intelligence analysts used F
ERRETto research 8 dif-
ferent scenarios related to chemical and biological
In the currently-implemented version of F
ERRET,
weapons. Experiment 2 and Experiment 3 consid-
we used Similarity Metric 5 to automatically iden-
ered several of the same scenarios addressed in E1:
tify the set of 10 QUAB questions that were most
E2 included 24 mixed teams of analysts and novice
similar to a user’s question. These question-and-
answer pairs were then returned to the user – along
users working with 2 scenarios, while E3 featured 4
with answers from F
ERRET’s automatic Q/A system
USNR analysts working with 6 of the original 8 sce-
narios. (Details for each experiment are provided in
– as potential continuations of the Q/A dialogue. We
used the remaining 6 similarity metrics described in
Table 2.) Users were also given a task to focus their
Country
n
QUAB
User Q
Total
research; in E1 and E3, users prepared a short report
(avg.)
(avg.)
(avg.)
detailing their findings; in E2, users were given a list
India
2
21.5
13.0
34.5
Libya
2
12.0
9.0
21.0
of “challenge” questions to answer.
Iran
2
18.5
11.0
29.5
N.Korea
2
16.5
7.5
34.0
Pakistan
2
29.5
15.5
45.0
Exp
Users
QUABs?
Scenarios
Topics
S.Africa
2
14.5
6.0
20.5
E1
8
Yes
8
Egypt BW, Russia CW, South
Russia
2
13.5
15.5
29.0
Africa CW, India CW, North
Egypt
2
15.0
20.5
35.5
Korea CBW, Pakistan CW,
TOTAL(E1)
16
17.63
12.25
29.88
Libya CW, Iran CW
E2
24
Yes
2
Egypt BW, Russia CW
Table 4: Efficiency of Dialogues in Experiment 1
E3
4
No
6
Egypt BW, Russia CW, North
Korea CBW, Pakistan CW
India CW, Libya CW, Iran CW
Russia
24
8.2
5.5
13.7
Table 2: Experiment details
Egypt
24
10.8
7.6
18.4
TOTAL(E2)
48
9.50
6.55
16.05
In E1 and E2, users had access to a total of 3210
Table 5: Efficiency of Dialogues in Experiment 2
QUAB questions that had been hand-created by de-
velopers for each the 8 dialogue scenarios. (Table 3
provides totals for each scenario.) In E3, users per-
entered by a user: each QUAB pair returned was
formed research with a version of F
ERRETthat in-
graded as “relevant” or “irrelevant” to a user ques-
cluded no QUABs at all.
tion in a forced-choice task. Aggregate relevance
scores were used to calculate (1) the percentage of
Scenario
Handcrafted QUABs
relevant QUAB pairs returned and (2) the mean re-
I
NDIA
460
L
IBYA
414
ciprocal rank (MRR) for each user question. MRR is
I
RAN
522
N
ORTH
K
OREA
316
defined as
+
<
%
F
, whree
is the lowest rank of
P
AKISTAN
322
<
<
any relevant answer for the
user query
7
. Table 6
S
OUTH
A
FRICA
454
?
R
USSIA
366
describes the performance of F
ERRETwhen each of
E
GYPT
356
Testing Total
3210
the 7 similarity measures presented in Section 4 are
used to return QUAB pairs in response to a query.
Table 3: QUAB distribution over scenarios
When only answers from F
ERRET’s automatic Q/A
We have evaluated F
ERRETby measuring effi-
system were available to users, only 15.7% of sys-
ciency, effectiveness, and user satisfaction:
tem responses were deemed to be relevant to a user’s
Efficiency F
ERRET’s QUAB collection enabled
query. In contrast, when manually-generated QUAB
users in our experiments to find more relevant infor-
pairs were introduced, as high as 84% of the sys-
mation by asking fewer questions. When manually-
tem’s responses were deemed to be relevant. The
created QUABs were available (E1 and E2), users
results listed in Table 6 show that the best metric is
submitted an average of 12.25 questions each ses-
Similarity Metric 5. Thse results suggest that the
sion. When no QUABs were available (E3), users
selection of relevant questions depends on sophis-
entered a total of 44.5 questions per session. Table 4
ticated similarity measures that rely on conceptual
lists the number of QUAB question-answer pairs se-
hierarchies and semantic recognizers.
lected by users and the number of user questions en-
We evaluated the quality of each of the four
tered by users during the 8 scenarios considered in
sets of automatically-generated QUABs in a sim-
E1. In E2, freed from the task of writing a research
ilar fashion. For each question submitted by a
report, users asked significantly (p 0.05) fewer
user in E1, E2, and E3, we collected the top 5
questions and selected fewer QUABs than they did
QUAB question-answer pairs (as determined by
in E1. (See Table 5).
Similarity Metric 5) that F
ERRETreturned. As with
Effectiveness QUAB question-answer pairs also
the manually-generated QUABs, the automatically-
improved the overall accuracy of the answers re-
turned by F
ERRET. To measure the effectiveness of
7
We chose MRR as our scoring metric because it reflects thefact that a user is most likely to examine the first few answersa Q/A dialogue, human annotators were used to per-
from any system, but that all correct answers returned by theform a post-hoc analysis of how relevant the QUAB
system have some value because users will sometimes examinepairs returned by F
ERRETwere to each question
a very large list of query results.% of Top 5 Responses
% of Top 1 Responses
MRR
make suggestions to a user of potential relevant con-
Relevant to User Q
Relevant to User Q
tinuations of a discourse. In this paper, we have
Without QUAB
15.73%
26.85%
0.325
Similarity 1
82.61%
60.63%
0.703
presented F
ERRET, an interactive Q/A system which
Similarity 2
79.95%
58.45%
0.681
Similarity 3
79.47%
56.04%
0.664
makes use of a novel Q/A architecture that integrates
Similarity 4
78.26%
46.14%
0.592
QUAB question-answer pairs into the processing of
Similarity 5
84.06%
68.36%
0.753
Similarity 6
81.64%
56.04%
0.671
questions. Experiments with F
ERREThave shown
Similarity 7
84.54%
64.01%
0.730
that, in addition to being rapidly adopted by users as
Table 6: Effectiveness of dialogs
valid suggestions, the incorporation of QUABs into
Q/A can greatly improve the overall accuracy of an
interactive Q/A dialogue.
generated pairs were submitted to human assessors
who annotated each as “relevant” or irrelevant to the
References
user’s query. Aggregate scores are presented in Ta-
S. Dudani. 1976. The distance-weighted k-nearest-neighbourble 7.
rule. IEEE Transactions on Systems, Man, and Cybernetics,Egypt
Russia
SMC-6(4):325–327.Approach
% of Top 5
% of Top 5
Responses Rel.
MRR
Responses Rel.
MRR
S. Harabagiu, D. Moldovan, C. Clark, M. Bowden, J. Williams,to User Q
to User Q
and J. Bensley. 2003. Answer Mining by Combining Ex-Approach 1
40.01%
0.295
60.25%
0.310
traction Techniques with Abductive Reasoning. In Proceed-Approach 2
36.00%
0.243
72.00%
0.475
Approach 3
44.62%
0.271
60.00%
0.297
ings of the Twelfth Text Retrieval Conference (TREC 2003).Approach 4
68.05%
0.510
68.00%
0.406
Sanda Harabagiu. 2004. Incremental Topic Representations.Table 7: Quality of QUABs acquired automatically
In Proceedings of the 20th COLING Conference, Geneva,Switzerland.User Satisfaction Users were consistently satis-
fied with their interactions with F
ERRET. In all three
Marti Hearst. 1994. Multi-Paragraph Segmentation of Exposi-tory Text. In Proceedings of the 32nd Meeting of the Associ-experiments, respondents claimed that they found
ation for Computational Linguistics, pages 9–16.that F
ERRET(1) gave meaningful answers, (2) pro-
Megumi Kameyama. 1997. Recognizing Referential Links: Anvided useful suggestions, (3) helped answer spe-
Information Extraction Perspective. In Workshop of Opera-cific questions, and (4) promoted their general un-
tional Factors in Practical, Robust Anaphora Resolution forderstanding of the issues considered in the scenario.
Unrestricted Texts, (ACL-97/EACL-97), pages 46–53.Complete results of this study are presented in Ta-
Chin-Yew Lin and Eduard Hovy. 2000. The Automated Acqui-ble 8
8
.
sition of Topic Signatures for Text Summarization. In Pro-ceedings of the 18th COLING Conference, pages 495–501.Factor
E1
E2
E3
Promoted understanding
3.40
3.20
3.75
S. Lytinen and N. Tomuro. 2002. The Use of Question TypesHelped with specific questions
3.70
3.60
3.25
Make good use of questions
3.40
3.55
3.0
to Match Questions in FAQFinder. In Papers from the 2002Gave new scenario insights
3.00
3.10
2.2
AAAI Spring Symposium on Mining Answers from Texts andGave good collection coverage
3.75
3.70
3.75
Knowledge Bases, pages 46–53.Stimulated user thinking
3.50
3.20
2.75
Easy to use
3.50
3.55
4.10
Srini Narayanan and Sanda Harabagiu. 2004. Question An-Expanded understanding
3.40
3.20
3.00
Gave meaningful answers
4.10
3.60
2.75
swering Based on Semantic Structures. In Proceedings ofWas helpful
4.00
3.75
3.25
the 20th COLING Conference, Geneva, Switzerland.Helped with new search methods
2.75
3.05
2.25
Provided novel suggestions
3.25
3.40
2.65
Mihai Surdeanu and Sanda M. Harabagiu. 2002. InfratructureIs ready for work environment
2.85
2.80
3.25
for open-domanin information extraction. In Conference forWould speed up work
3.25
3.25
3.00
Human Language Technology (HLT-2002).Overall like of system
3.75
3.60
3.75
Table 8: User Satisfaction Survey Results
Mihai Surdeanu, Sanda M. Harabagiu, John Williams, and PaulAarseth. 2003. Using predicate-argument structures for in-6 Conclusions
formation extraction. In ACL, pages 8–15.Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and SiljaWe believe that the quality of Q/A interactions de-
Huttunen. 2000. Automatic Acquisition of Domain Knowl-pends on the modeling of scenario topics. An ideal
edge for Information Extraction. In Proceedings of the 18thmodel is provided by question-answer databases
COLING Conference, pages 940–946.(QUABs) that are created off-line and then used to
Roman Yangarber. 2003. Counter-Training in Discovery ofSemantic Patterns. In Proceedings of the 41th Meeting of the8
Evaluation scale: 1-does not describe the system, 5-Association for Computational Linguistics, pages 343–350.completely describes the system