3 EVALUATION METHODAMBIGUATE THE MOST AMBIGUOUS PHRASE IN HIS OR HER...

3.3 Evaluation method

ambiguate the most ambiguous phrase in his or her

The DQs generated by the DDQ module were eval-

question.

uated in comparison with manual disambiguation

3 Evaluation Experiments

queries. Although the questions read by the seven

Questions consisting of 69 sentences read aloud by

speakers had sufficient information to extract ex-

seven male speakers were transcribed by our ASR

act answers, some recognition errors resulted in a

loss of information that was indispensable for ob-

Table 2: Evaluation results of disambiguating

taining the correct answers. The manual DQs were

queries generated by the DDQ module.

made by five subjects based on a comparison of

Word MRR w/o IN-

the original written questions and the transcription

SPK acc. REC DEL SCRN DQ errors APPAPP

results given by the ASR system. The automatic

A 70% 0.19 0.16 0.17 0.23 4 32 33

DQs were categorized into two classes: APPRO-

B 76% 0.31 0.24 0.29 0.31 8 36 25C 79% 0.26 0.18 0.26 0.30 10 34 25

PRIATE when they had the same meaning as at

D 73% 0.27 0.21 0.24 0.30 4 35 30

least one of the five manual DQs, and INAPPRO-

E 78% 0.24 0.21 0.24 0.27 7 31 31

PRIATE when there was no match. The QA per-

F 80% 0.28 0.25 0.30 0.33 8 34 27G 74% 0.22 0.19 0.19 0.22 3 35 31

formance in using recognized (REC) and screened

AVG 76% 0.25 0.21 0.24 0.28 9% 49% 42%

questions (SCRN) were evaluated by MRR (Mean

An integer without a % other than MRRs indicates number of

Reciprocal Rank) (https://traloihay.net).

sentences. Word acc.:word accuracy, SPK:speaker, AVG: aver-

SCRN was compared with the transcribed question

aged values, w/o errors: transcribed sentences without recog-

that just had recognition errors removed (DEL). In

nition errors, APP: appropriate DQs and InAPP: inappropriateDQs.

addition, the questions reconstructed manually by

merging these questions and additional information

requested the DQs generated by using SCRN, (DQ)

clude an evaluation of the appropriateness of DQs

were also evaluated. The additional information was

derived repeatedly to obtain the final answers. In

extracted from the original users’ question without

addition, the interaction strategy automatically gen-

recognition errors. In this study, adding information

erated by the DDQ module should be evaluated in

by using the DQs was performed only once.

terms of how much the DQs improve QA’s total per-

formance.