SECTION 4 HIGHLIGHTS THE MANAGEMENT OF THE INTER-FORMANCE OF A SYSTEM...

6, then

TVU

receives a cluster adjustment score

logues, we have conducted three sets of experiments

in order to boost its ranking within its QUAB

with human users of F

ERRET

. In these experiments,

cluster. We calculate the cluster adjustment

users were allotted two hours to interact with Ferret

score as

)

&

B

TYU

)

1

3

99

to gather information requested by a dialogue sce-

)

, where

represents the difference

nario similar to the one presented in Figure 2. In

in rank between the centroid of the cluster and

Experiment 1 (E1), 8 U.S. Navy Reserve (USNR)

the previous rank of the QUAB question

T

U

.

intelligence analysts used F

ERRET

to research 8 dif-

ferent scenarios related to chemical and biological

In the currently-implemented version of F

ERRET

,

weapons. Experiment 2 and Experiment 3 consid-

we used Similarity Metric 5 to automatically iden-

ered several of the same scenarios addressed in E1:

tify the set of 10 QUAB questions that were most

E2 included 24 mixed teams of analysts and novice

similar to a user’s question. These question-and-

answer pairs were then returned to the user – along

users working with 2 scenarios, while E3 featured 4

with answers from F

ERRET

’s automatic Q/A system

USNR analysts working with 6 of the original 8 sce-

narios. (Details for each experiment are provided in

– as potential continuations of the Q/A dialogue. We

used the remaining 6 similarity metrics described in

Table 2.) Users were also given a task to focus their

Country

n

QUAB

User Q

Total

research; in E1 and E3, users prepared a short report

(avg.)

(avg.)

(avg.)

detailing their findings; in E2, users were given a list

India

2

21.5

13.0

34.5

Libya

2

12.0

9.0

21.0

of “challenge” questions to answer.

Iran

2

18.5

11.0

29.5

N.Korea

2

16.5

7.5

34.0

Pakistan

2

29.5

15.5

45.0

Exp

Users

QUABs?

Scenarios

Topics

S.Africa

2

14.5

6.0

20.5

E1

8

Yes

8

Egypt BW, Russia CW, South

Russia

2

13.5

15.5

29.0

Africa CW, India CW, North

Egypt

2

15.0

20.5

35.5

Korea CBW, Pakistan CW,

TOTAL(E1)

16

17.63

12.25

29.88

Libya CW, Iran CW

E2

24

Yes

2

Egypt BW, Russia CW

Table 4: Efficiency of Dialogues in Experiment 1

E3

4

No

6

Egypt BW, Russia CW, North

Korea CBW, Pakistan CW

India CW, Libya CW, Iran CW

Russia

24

8.2

5.5

13.7

Table 2: Experiment details

Egypt

24

10.8

7.6

18.4

TOTAL(E2)

48

9.50

6.55

16.05

In E1 and E2, users had access to a total of 3210

Table 5: Efficiency of Dialogues in Experiment 2

QUAB questions that had been hand-created by de-

velopers for each the 8 dialogue scenarios. (Table 3

provides totals for each scenario.) In E3, users per-

entered by a user: each QUAB pair returned was

formed research with a version of F

ERRET

that in-

graded as “relevant” or “irrelevant” to a user ques-

cluded no QUABs at all.

tion in a forced-choice task. Aggregate relevance

scores were used to calculate (1) the percentage of

Scenario

Handcrafted QUABs

relevant QUAB pairs returned and (2) the mean re-

I

NDIA

460

L

IBYA

414

ciprocal rank (MRR) for each user question. MRR is

I

RAN

522

N

ORTH

K

OREA

316

defined as

+

<

%

F

, whree

is the lowest rank of

P

AKISTAN

322

<

<

any relevant answer for the

user query

7

. Table 6

S

OUTH

A

FRICA

454

?

R

USSIA

366

describes the performance of F

ERRET

when each of

E

GYPT

356

Testing Total

3210

the 7 similarity measures presented in Section 4 are

used to return QUAB pairs in response to a query.

Table 3: QUAB distribution over scenarios

When only answers from F

ERRET

’s automatic Q/A

We have evaluated F

ERRET

by measuring effi-

system were available to users, only 15.7% of sys-

ciency, effectiveness, and user satisfaction:

tem responses were deemed to be relevant to a user’s

Efficiency F

ERRET

’s QUAB collection enabled

query. In contrast, when manually-generated QUAB

users in our experiments to find more relevant infor-

pairs were introduced, as high as 84% of the sys-

mation by asking fewer questions. When manually-

tem’s responses were deemed to be relevant. The

created QUABs were available (E1 and E2), users

results listed in Table 6 show that the best metric is

submitted an average of 12.25 questions each ses-

Similarity Metric 5. Thse results suggest that the

sion. When no QUABs were available (E3), users

selection of relevant questions depends on sophis-

entered a total of 44.5 questions per session. Table 4

ticated similarity measures that rely on conceptual

lists the number of QUAB question-answer pairs se-

hierarchies and semantic recognizers.

lected by users and the number of user questions en-

We evaluated the quality of each of the four

tered by users during the 8 scenarios considered in

sets of automatically-generated QUABs in a sim-

E1. In E2, freed from the task of writing a research

ilar fashion. For each question submitted by a

report, users asked significantly (p 0.05) fewer

user in E1, E2, and E3, we collected the top 5

questions and selected fewer QUABs than they did

QUAB question-answer pairs (as determined by

in E1. (See Table 5).

Similarity Metric 5) that F

ERRET

returned. As with

Effectiveness QUAB question-answer pairs also

the manually-generated QUABs, the automatically-

improved the overall accuracy of the answers re-

turned by F

ERRET

. To measure the effectiveness of

7

We chose MRR as our scoring metric because it reflects thefact that a user is most likely to examine the first few answers

a Q/A dialogue, human annotators were used to per-

from any system, but that all correct answers returned by the

form a post-hoc analysis of how relevant the QUAB

system have some value because users will sometimes examine

pairs returned by F

ERRET

were to each question

a very large list of query results.

% of Top 5 Responses

% of Top 1 Responses

MRR

make suggestions to a user of potential relevant con-

Relevant to User Q

Relevant to User Q

tinuations of a discourse. In this paper, we have

Without QUAB

15.73%

26.85%

0.325

Similarity 1

82.61%

60.63%

0.703

presented F

ERRET

, an interactive Q/A system which

Similarity 2

79.95%

58.45%

0.681

Similarity 3

79.47%

56.04%

0.664

makes use of a novel Q/A architecture that integrates

Similarity 4

78.26%

46.14%

0.592

QUAB question-answer pairs into the processing of

Similarity 5

84.06%

68.36%

0.753

Similarity 6

81.64%

56.04%

0.671

questions. Experiments with F

ERRET

have shown

Similarity 7

84.54%

64.01%

0.730

that, in addition to being rapidly adopted by users as

Table 6: Effectiveness of dialogs

valid suggestions, the incorporation of QUABs into

Q/A can greatly improve the overall accuracy of an

interactive Q/A dialogue.

generated pairs were submitted to human assessors

who annotated each as “relevant” or irrelevant to the

References

user’s query. Aggregate scores are presented in Ta-

S. Dudani. 1976. The distance-weighted k-nearest-neighbour

ble 7.

rule. IEEE Transactions on Systems, Man, and Cybernetics,

Egypt

Russia

SMC-6(4):325–327.

Approach

% of Top 5

% of Top 5

Responses Rel.

MRR

Responses Rel.

MRR

S. Harabagiu, D. Moldovan, C. Clark, M. Bowden, J. Williams,

to User Q

to User Q

and J. Bensley. 2003. Answer Mining by Combining Ex-

Approach 1

40.01%

0.295

60.25%

0.310

traction Techniques with Abductive Reasoning. In Proceed-

Approach 2

36.00%

0.243

72.00%

0.475

Approach 3

44.62%

0.271

60.00%

0.297

ings of the Twelfth Text Retrieval Conference (TREC 2003).

Approach 4

68.05%

0.510

68.00%

0.406

Sanda Harabagiu. 2004. Incremental Topic Representations.

Table 7: Quality of QUABs acquired automatically

In Proceedings of the 20th COLING Conference, Geneva,Switzerland.

User Satisfaction Users were consistently satis-

fied with their interactions with F

ERRET

. In all three

Marti Hearst. 1994. Multi-Paragraph Segmentation of Exposi-tory Text. In Proceedings of the 32nd Meeting of the Associ-

experiments, respondents claimed that they found

ation for Computational Linguistics, pages 9–16.

that F

ERRET

(1) gave meaningful answers, (2) pro-

Megumi Kameyama. 1997. Recognizing Referential Links: An

vided useful suggestions, (3) helped answer spe-

Information Extraction Perspective. In Workshop of Opera-

cific questions, and (4) promoted their general un-

tional Factors in Practical, Robust Anaphora Resolution for

derstanding of the issues considered in the scenario.

Unrestricted Texts, (ACL-97/EACL-97), pages 46–53.

Complete results of this study are presented in Ta-

Chin-Yew Lin and Eduard Hovy. 2000. The Automated Acqui-

ble 8

8

.

sition of Topic Signatures for Text Summarization. In Pro-ceedings of the 18th COLING Conference, pages 495–501.

Factor

E1

E2

E3

Promoted understanding

3.40

3.20

3.75

S. Lytinen and N. Tomuro. 2002. The Use of Question Types

Helped with specific questions

3.70

3.60

3.25

Make good use of questions

3.40

3.55

3.0

to Match Questions in FAQFinder. In Papers from the 2002

Gave new scenario insights

3.00

3.10

2.2

AAAI Spring Symposium on Mining Answers from Texts and

Gave good collection coverage

3.75

3.70

3.75

Knowledge Bases, pages 46–53.

Stimulated user thinking

3.50

3.20

2.75

Easy to use

3.50

3.55

4.10

Srini Narayanan and Sanda Harabagiu. 2004. Question An-

Expanded understanding

3.40

3.20

3.00

Gave meaningful answers

4.10

3.60

2.75

swering Based on Semantic Structures. In Proceedings of

Was helpful

4.00

3.75

3.25

the 20th COLING Conference, Geneva, Switzerland.

Helped with new search methods

2.75

3.05

2.25

Provided novel suggestions

3.25

3.40

2.65

Mihai Surdeanu and Sanda M. Harabagiu. 2002. Infratructure

Is ready for work environment

2.85

2.80

3.25

for open-domanin information extraction. In Conference for

Would speed up work

3.25

3.25

3.00

Human Language Technology (HLT-2002).

Overall like of system

3.75

3.60

3.75

Table 8: User Satisfaction Survey Results

Mihai Surdeanu, Sanda M. Harabagiu, John Williams, and PaulAarseth. 2003. Using predicate-argument structures for in-

6 Conclusions

formation extraction. In ACL, pages 8–15.Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and Silja

We believe that the quality of Q/A interactions de-

Huttunen. 2000. Automatic Acquisition of Domain Knowl-

pends on the modeling of scenario topics. An ideal

edge for Information Extraction. In Proceedings of the 18th

model is provided by question-answer databases

COLING Conference, pages 940–946.

(QUABs) that are created off-line and then used to

Roman Yangarber. 2003. Counter-Training in Discovery ofSemantic Patterns. In Proceedings of the 41th Meeting of the

8

Evaluation scale: 1-does not describe the system, 5-Association for Computational Linguistics, pages 343–350.completely describes the system