5 (1 – STRONGLY DISAGREE, 2 – DISAGREE, 3 – SOMEWHAT AGREE, 4 – AGRE...
1-5 (1 – Strongly Disagree, 2 – Disagree, 3 –
Somewhat Agree, 4 – Agree, 5 – Strongly Agree).
Structure. Users perceive the system as having
a structured tutoring plan significantly
2
more in the
If indeed the NM has any effect we should observe
differences between the ratings of the NM problem
NM problems (Q8). Moreover, it is significantly
easier for them to follow this tutoring plan if the
and the noNM problem (i.e. the NM is disabled).
NM is present (Q11). These effects are very clear
Table 1 lists the 16 questions in the question-
for F users where their ratings differ significantly
naire order. The table shows for every question the
between the first (NM) and the second problem
average rating for all condition-problem combina-
(noNM). A difference in ratings is present for S
tions (e.g. column 5: condition F problem 1 with
the NM enabled). For all questions except Q7 and
users but it is not significant. As with most of the S
users’ ratings, we believe that the NM presentation
Q11 a higher rating is better. For Q7 and Q11
order is responsible for the mostly non-significant
(italicized in Table 1) a lower rating is better as
differences. More specifically, assuming that the
they gauge negative factors (high level of concen-
tration and task disorientation). They also served as
NM has a positive effect, the S users are asked to
rate first the poorer version of the system (noNM)
a deterrent for negligence while rating.
To test if the NM presence has a significant ef-
and then the better version (NM). In contrast, F
users’ task is easier as they already have a high
fect, a repeated-measure ANOVA with between-
subjects factors was applied. The within-subjects
reference point (NM) and it is easier for them to
criticize the second problem (noNM). Other factors
factor was the NM presence (NMPres) and the
between-subjects factor was the condition (Cond)
1
.
that can blur the effect of the NM are domain
learning and user’s adaptation to the system.
The significance of the effect of each factor and
Integration. Q9 and Q10 look at how well users
their combination (NMPres*Cond) is listed in the
table with significant and trend effects highlighted
think they integrate the system questions in both a
in bold (see columns 2-4). Post-hoc t-tests between
forward-looking fashion (Q9) and a backward
the NM and noNM ratings were run for each con-
looking fashion (Q10). Users think that it is sig-
dition (“s”/“t”marks significant/trend differences).
nificantly easier for them to integrate the current
system question to what will be discussed in the
Results for Q1-6
future if the NM is present (Q9). Also, if the NM is
Questions Q1-6 were inspired by previous work
present, it is easier for users to integrate the current
on spoken dialogue system evaluation (e.g.
question to the discussion so far (Q10, trend). For
(Walker et al., 2000)) and measure user’s overall
Q10, there is no difference for F users but a sig-
perception of the system. We find that the NM
nificant one for S users. We hypothesize that do-
presence significantly improves user’s perception
main learning is involved here: F users learn better
of the system in terms of their ability to concen-
from the first problem (NM) and thus have less
trate on the instruction (Q3), in terms of their incli-
issues solving the second problem (noNM). In con-
nation to reuse the system (Q6) and in terms of the
trast, S users have more difficulties in the first
system’s matching of their expectations (Q4).
problem (noNM), but the presence of the NM
There is a trend that it was easier for them to learn
eases their task in the second problem.
from the NM enabled version of the system (Q2).
Correctness. The correct answer NM feature is
Results for Q7-13
useful for users too. There is a trend that it is easier
Q7-13 relate directly to our hypothesis that users
for users to know the correct answer if the NM is
present (Q13). We hypothesize that speech recog-
1
Since in this version of ANOVA the NM/noNM rat-
nition and language understanding errors are re-
ings come from two different problems based on the
condition, we also run an ANOVA in which the within-
2