1 SUMMARIZATION RESULTSTO EVALUATE THE PERFORMANCE OF OUR CRF BASED...
5.1 Summarization Results
To evaluate the performance of our CRF based an-
swer summarization model, we conduct experiments
We adapt the Support Vector Machine (SVM) and
on the Yahoo! Answers archives dataset. The Ya-
Logistic Regression (LR) which have been reported
hoo! W ebscope
T M
Program
4
opens up a number of
to be effective for classification and the Linear CRF
Yahoo! Answers datasets for interested academics
(LCRF) which is used to summarize ordinary text
in different categories. Our original dataset con-
documents in (Shen et al., 2007) as baselines for
tains 1,300,559 questions and 2,770,896 answers in
comparison. To better illustrate the effectiveness of
ten taxonomies from Yahoo! Answers. After fil-
question segmentation based contextual factors and
tering the questions which have less than 5 answers
the group L
1
regularization term, we carry the tests
and some trivial factoid questions using the features
in the following sequence: (a) we use only the con-
by (Tomasoni and Huang, 2010) , we reduce the
textual factors cf
3
and cf
4
with default L
2
regular-
dataset to 55,132 questions. From this sub-set, we
ization (gCRF); (b) we add the reply question based
next select the questions with incomplete answers
factors cf
1
and cf
2
to the model (gCRF-QS); and (c)
as defined in Section 2.1. Specifically, we select the
we replace default L
2
regularization with our pro-
questions where the average similarity between the
posed group L
1
regularization term (gCRF-QS-l1).
best answer and all sub questions is less than 0.6 or
For linear CRF system, we use all our textual and
when the star rating of the best answer is less than 4.
non-textual features as well as the local (exact pre-
We obtain 7,784 questions after this step. To eval-
vious and next) neighborhood contextual factors in-
uate the effectiveness of this method, we randomly
stead of the features of (Shen et al., 2007) for fair-
choose 400 questions in the filtered dataset and in-
ness. For the thresholds used in the contextual fac-
vite 10 graduate candidate students (not in NLP re-
tors, we enforce τ
lq
to be equal to τ
ls
and τ
uq
equal
search field) to verify whether a question suffers
to τ
us
for the purpose of simplifying the parameters
from the incomplete answer problem. We divide the
setting (τ
lq
= τ
ls
= 0.4, τ
uq
= τ
us
= 0.8 in our ex-
students into five groups of two each. We consider
periments). We randomly divide the dataset into ten
the questions as the “incomplete answer questions”
subsets (every subset with 40 questions and the as-
only when they are judged by both members in a
sociated answers), and conduct a ten-fold cross val-
group to be the case. As a result, we find that 360
idation and for each round where the nine subsets
(90%) of these questions indeed suffer from the in-
are used to train the model and the remaining one
complete answer problem, which indicates that our
for testing. The precision, recall and F1 measures of
automatic detection method is efficient. This ran-
these models are presented in Table 2.
domly selected 400 questions along with their 2559
Table 2 shows that our general CRF model based
answers are then further manually summarized for
on question segmentation with group L
1
regulariza-
evaluation of automatically generated answer sum-
tion out-performs the baselines significantly in all
maries by our model in experiments.
three measures (gCRF-QS-l1 is 13.99% better than
SVM in precision, 9.77% better in recall and 11.72%
4
https://traloihay.net