5 (1 – STRONGLY DISAGREE, 2 – DISAGREE, 3 – SOMEWHAT AGREE, 4 – AGRE...

1-5 (1 – Strongly Disagree, 2 – Disagree, 3 –

Somewhat Agree, 4 – Agree, 5 – Strongly Agree).

Structure. Users perceive the system as having

a structured tutoring plan significantly

2

more in the

If indeed the NM has any effect we should observe

differences between the ratings of the NM problem

NM problems (Q8). Moreover, it is significantly

easier for them to follow this tutoring plan if the

and the noNM problem (i.e. the NM is disabled).

NM is present (Q11). These effects are very clear

Table 1 lists the 16 questions in the question-

for F users where their ratings differ significantly

naire order. The table shows for every question the

between the first (NM) and the second problem

average rating for all condition-problem combina-

(noNM). A difference in ratings is present for S

tions (e.g. column 5: condition F problem 1 with

the NM enabled). For all questions except Q7 and

users but it is not significant. As with most of the S

users’ ratings, we believe that the NM presentation

Q11 a higher rating is better. For Q7 and Q11

order is responsible for the mostly non-significant

(italicized in Table 1) a lower rating is better as

differences. More specifically, assuming that the

they gauge negative factors (high level of concen-

tration and task disorientation). They also served as

NM has a positive effect, the S users are asked to

rate first the poorer version of the system (noNM)

a deterrent for negligence while rating.

To test if the NM presence has a significant ef-

and then the better version (NM). In contrast, F

users’ task is easier as they already have a high

fect, a repeated-measure ANOVA with between-

subjects factors was applied. The within-subjects

reference point (NM) and it is easier for them to

criticize the second problem (noNM). Other factors

factor was the NM presence (NMPres) and the

between-subjects factor was the condition (Cond)

1

.

that can blur the effect of the NM are domain

learning and user’s adaptation to the system.

The significance of the effect of each factor and

Integration. Q9 and Q10 look at how well users

their combination (NMPres*Cond) is listed in the

table with significant and trend effects highlighted

think they integrate the system questions in both a

in bold (see columns 2-4). Post-hoc t-tests between

forward-looking fashion (Q9) and a backward

the NM and noNM ratings were run for each con-

looking fashion (Q10). Users think that it is sig-

dition (“s”/“t”marks significant/trend differences).

nificantly easier for them to integrate the current

system question to what will be discussed in the

Results for Q1-6

future if the NM is present (Q9). Also, if the NM is

Questions Q1-6 were inspired by previous work

present, it is easier for users to integrate the current

on spoken dialogue system evaluation (e.g.

question to the discussion so far (Q10, trend). For

(Walker et al., 2000)) and measure user’s overall

Q10, there is no difference for F users but a sig-

perception of the system. We find that the NM

nificant one for S users. We hypothesize that do-

presence significantly improves user’s perception

main learning is involved here: F users learn better

of the system in terms of their ability to concen-

from the first problem (NM) and thus have less

trate on the instruction (Q3), in terms of their incli-

issues solving the second problem (noNM). In con-

nation to reuse the system (Q6) and in terms of the

trast, S users have more difficulties in the first

system’s matching of their expectations (Q4).

problem (noNM), but the presence of the NM

There is a trend that it was easier for them to learn

eases their task in the second problem.

from the NM enabled version of the system (Q2).

Correctness. The correct answer NM feature is

Results for Q7-13

useful for users too. There is a trend that it is easier

Q7-13 relate directly to our hypothesis that users

for users to know the correct answer if the NM is

present (Q13). We hypothesize that speech recog-

1

Since in this version of ANOVA the NM/noNM rat-

nition and language understanding errors are re-

ings come from two different problems based on the

condition, we also run an ANOVA in which the within-

2

We refer to the significance of the NMPres factor (Ta-

subjects factor was the problem (Prob). In this case, the

NM effect corresponds to an effect from Prob*Cond

ble 1, column 2). When discussing individual experi-

which is identical in significance with that of NMPres.

mental conditions, we refer to the post-hoc t-tests.

Average ratingANOVA F condition S conditionQuestionP2 P1P1 P2Overall NMPres Cond NMPres*NM noNMCond