ROUGE-2 AND ROUGE-L BY A SIGNIFICANT MAR-DIFFERENCE IN STATISTICAL...
1, ROUGE-2 and ROUGE-L by a significant mar-
difference in statistical significance.gin compared to other baselines due to the use of
local and non-local contextual factors and factors
based on QS with group L
1
regularization. Since
which just utilize the independent sentence-level
the ROUGE measures care more about the recall and
features, behave not vary well here, and there is no
precision of N-grams as well as common substrings
statistically significant performance difference be-
to the reference summary rather than the whole sen-
tween them. We also find that LCRF which utilizes
tence, they offer a better measurement in modeling
the local context information between sentences per-
the user’s information needs. Therefore, the im-
form better than the LR method in precision and F1
provements in these measures are more encouraging
with statistical significance. While we consider the
than those of the average classification accuracy for
general local and non-local contextual factor cf
3
and
answer summarization.
cf
4
for novelty and non-redundancy constraints, the
gCRF performs much better than LCRF in all three
From the viewpoint of ROUGE measures we ob-
serve that our question segmentation method can en-
measures; and we obtain further performance im-
hance the recall of the summaries significantly due
provement by adding the contextual factors based
on QS, especially in the recall measurement. This
to the more fine-grained modeling of sub questions.
We also find that the precision of the group L
1
reg-
is mainly because we have divided the question into
ularization is much better than that of the default
several sub questions, and the system is able to se-
L
2
regularization while not hurting the recall signifi-
lect more novel sentences than just treating the origi-
nal multi-sentence as a whole. In addition, when we
cantly. In general, the experimental results show that
replace the default L
2
regularization by the group
our proposed method is more effective than other
L
1