ROUGE-2 AND ROUGE-L BY A SIGNIFICANT MAR-DIFFERENCE IN STATISTICAL...

1, ROUGE-2 and ROUGE-L by a significant mar-

difference in statistical significance.

gin compared to other baselines due to the use of

local and non-local contextual factors and factors

based on QS with group L

1

regularization. Since

which just utilize the independent sentence-level

the ROUGE measures care more about the recall and

features, behave not vary well here, and there is no

precision of N-grams as well as common substrings

statistically significant performance difference be-

to the reference summary rather than the whole sen-

tween them. We also find that LCRF which utilizes

tence, they offer a better measurement in modeling

the local context information between sentences per-

the user’s information needs. Therefore, the im-

form better than the LR method in precision and F1

provements in these measures are more encouraging

with statistical significance. While we consider the

than those of the average classification accuracy for

general local and non-local contextual factor cf

3

and

answer summarization.

cf

4

for novelty and non-redundancy constraints, the

gCRF performs much better than LCRF in all three

From the viewpoint of ROUGE measures we ob-

serve that our question segmentation method can en-

measures; and we obtain further performance im-

hance the recall of the summaries significantly due

provement by adding the contextual factors based

on QS, especially in the recall measurement. This

to the more fine-grained modeling of sub questions.

We also find that the precision of the group L

1

reg-

is mainly because we have divided the question into

ularization is much better than that of the default

several sub questions, and the system is able to se-

L

2

regularization while not hurting the recall signifi-

lect more novel sentences than just treating the origi-

nal multi-sentence as a whole. In addition, when we

cantly. In general, the experimental results show that

replace the default L

2

regularization by the group

our proposed method is more effective than other

L

1

regularization for more efficient feature weight

baselines in answer summarization for addressing

learning, we obtain a much better performance in

the incomplete answer problem in cQAs.