1 SUMMARIZATION RESULTSTO EVALUATE THE PERFORMANCE OF OUR CRF BASED...

5.1 Summarization Results

To evaluate the performance of our CRF based an-

swer summarization model, we conduct experiments

We adapt the Support Vector Machine (SVM) and

on the Yahoo! Answers archives dataset. The Ya-

Logistic Regression (LR) which have been reported

hoo! W ebscope

T M

Program

4

opens up a number of

to be effective for classification and the Linear CRF

Yahoo! Answers datasets for interested academics

(LCRF) which is used to summarize ordinary text

in different categories. Our original dataset con-

documents in (Shen et al., 2007) as baselines for

tains 1,300,559 questions and 2,770,896 answers in

comparison. To better illustrate the effectiveness of

ten taxonomies from Yahoo! Answers. After fil-

question segmentation based contextual factors and

tering the questions which have less than 5 answers

the group L

1

regularization term, we carry the tests

and some trivial factoid questions using the features

in the following sequence: (a) we use only the con-

by (Tomasoni and Huang, 2010) , we reduce the

textual factors cf

3

and cf

4

with default L

2

regular-

dataset to 55,132 questions. From this sub-set, we

ization (gCRF); (b) we add the reply question based

next select the questions with incomplete answers

factors cf

1

and cf

2

to the model (gCRF-QS); and (c)

as defined in Section 2.1. Specifically, we select the

we replace default L

2

regularization with our pro-

questions where the average similarity between the

posed group L

1

regularization term (gCRF-QS-l1).

best answer and all sub questions is less than 0.6 or

For linear CRF system, we use all our textual and

when the star rating of the best answer is less than 4.

non-textual features as well as the local (exact pre-

We obtain 7,784 questions after this step. To eval-

vious and next) neighborhood contextual factors in-

uate the effectiveness of this method, we randomly

stead of the features of (Shen et al., 2007) for fair-

choose 400 questions in the filtered dataset and in-

ness. For the thresholds used in the contextual fac-

vite 10 graduate candidate students (not in NLP re-

tors, we enforce τ

lq

to be equal to τ

ls

and τ

uq

equal

search field) to verify whether a question suffers

to τ

us

for the purpose of simplifying the parameters

from the incomplete answer problem. We divide the

setting (τ

lq

= τ

ls

= 0.4, τ

uq

= τ

us

= 0.8 in our ex-

students into five groups of two each. We consider

periments). We randomly divide the dataset into ten

the questions as the “incomplete answer questions”

subsets (every subset with 40 questions and the as-

only when they are judged by both members in a

sociated answers), and conduct a ten-fold cross val-

group to be the case. As a result, we find that 360

idation and for each round where the nine subsets

(90%) of these questions indeed suffer from the in-

are used to train the model and the remaining one

complete answer problem, which indicates that our

for testing. The precision, recall and F1 measures of

automatic detection method is efficient. This ran-

these models are presented in Table 2.

domly selected 400 questions along with their 2559

Table 2 shows that our general CRF model based

answers are then further manually summarized for

on question segmentation with group L

1

regulariza-

evaluation of automatically generated answer sum-

tion out-performs the baselines significantly in all

maries by our model in experiments.

three measures (gCRF-QS-l1 is 13.99% better than

SVM in precision, 9.77% better in recall and 11.72%

4

https://traloihay.net

better in F1 score). We note that both SVM and LR,

Model R1 P R1 R R1 F1 R2 P R2 R R2 F1 RL P RL R RL F1SVM 79.2% 52.5% 63.1% 71.9% 41.3% 52.4% 67.1% 36.7% 47.4%LR 75.2% 57.4% 65.1% 66.1% 48.5% 56.0% 61.6% 43.2% 50.8%LCRF 78.7%- 61.8% 69.3%- 71.4%- 54.1% 61.6% 67.1%- 49.6% 57.0%gCRF 81.9% 65.2% 72.6% 76.8% 57.3% 65.7% 73.9% 53.5% 62.1%gCRF-QS 81.4%- 70.0% 75.3% 76.2%- 62.4% 68.6% 73.3%- 58.6% 65.1%gCRF-QS-l1 86.6% 68.3%- 76.4% 82.6% 61.5%- 70.5% 80.4% 58.2%- 67.5%Table 3: The Precision, Recall and F1 of ROUGE-1, ROUGE-2, ROUGE-L in the baselines SVM,LR, LCRF and ourgeneral CRF based models (gCRF, gCRF-QS, gCRF-QS-l1). The down-arrow means performance degradation withstatistical significance.

Model Precision Recall F1

precision while not sacrificing the recall measure-

SVM 65.93% 61.96% 63.88%

ment statistically.

LR 66.92%- 61.31%- 63.99%-

We also compute the Precision, Recall and F1

LCRF 69.80% 63.91%- 66.73%

in ROUGE-1, ROUGE-2 and ROUGE-L measure-

gCRF 73.77%↑ 69.43%↑ 71.53%↑

ments, which are widely used to measure the quality

gCRF-QS 74.78% 72.51% 73.63%

of automatic text summarization. The experimental

gCRF-QS-l1 79.92% 71.73%- 75.60%

results are listed in Table 3. All results in the Ta-

ble are the average of the ten-fold cross validation

Table 2: The Precision, Recall and F1 measures of the

experiments on our dataset.

baselines SVM,LR, LCRF and our general CRF basedmodels (gCRF, gCRF-QS, gCRF-QS-l1). The up-arrow

It is observed that our gCRF-QS-l1 model im-

denotes the performance improvement compared to the

proves the performance in terms of precision, recall

precious method (above) with statistical significance un-

and F1 score on all three measurements of ROUGE-

der p value of 0.05, the short line ’-’ denotes there is no