1 SUMMARIZATION RESULTSTO EVALUATE THE PERFORMANCE OF OUR CRF BASED...

Question

5.1 Summarization ResultsTo evaluate the performance of our CRF based an-swer summarization model, we conduct experimentsWe adapt the Support Vector Machine (SVM) andon the Yahoo! Answers archives dataset. The Ya-Logistic Regression (LR) which have been reportedhoo! W ebscopeT M Program4opens up a number ofto be effective for classification and the Linear CRFYahoo! Answers datasets for interested academics(LCRF) which is used to summarize ordinary textin different categories. Our original dataset con-documents in (Shen et al., 2007) as baselines fortains 1,300,559 questions and 2,770,896 answers incomparison. To better illustrate the effectiveness often taxonomies from Yahoo! Answers. After fil-question segmentation based contextual factors andtering the questions which have less than 5 answersthe group L1 regularization term, we carry the testsand some trivial factoid questions using the featuresin the following sequence: (a) we use only the con-by (Tomasoni and Huang, 2010) , we reduce thetextual factors cf3 and cf4 with default L2 regular-dataset to 55,132 questions. From this sub-set, weization (gCRF); (b) we add the reply question basednext select the questions with incomplete answersfactors cf1and cf2to the model (gCRF-QS); and (c)as defined in Section 2.1. Specifically, we select thewe replace default L2 regularization with our pro-questions where the average similarity between theposed group L1 regularization term (gCRF-QS-l1).best answer and all sub questions is less than 0.6 orFor linear CRF system, we use all our textual andwhen the star rating of the best answer is less than 4.non-textual features as well as the local (exact pre-We obtain 7,784 questions after this step. To eval-vious and next) neighborhood contextual factors in-uate the effectiveness of this method, we randomlystead of the features of (Shen et al., 2007) for fair-choose 400 questions in the filtered dataset and in-ness. For the thresholds used in the contextual fac-vite 10 graduate candidate students (not in NLP re-tors, we enforce τlqto be equal to τlsand τuq equalsearch field) to verify whether a question suffersto τusfor the purpose of simplifying the parametersfrom the incomplete answer problem. We divide thesetting (τlq = τls = 0.4, τuq = τus = 0.8 in our ex-students into five groups of two each. We considerperiments). We randomly divide the dataset into tenthe questions as the “incomplete answer questions”subsets (every subset with 40 questions and the as-only when they are judged by both members in asociated answers), and conduct a ten-fold cross val-group to be the case. As a result, we find that 360idation and for each round where the nine subsets(90%) of these questions indeed suffer from the in-are used to train the model and the remaining onecomplete answer problem, which indicates that ourfor testing. The precision, recall and F1 measures ofautomatic detection method is efficient. This ran-these models are presented in Table 2.domly selected 400 questions along with their 2559Table 2 shows that our general CRF model basedanswers are then further manually summarized foron question segmentation with group L1regulariza-evaluation of automatically generated answer sum-tion out-performs the baselines significantly in allmaries by our model in experiments.three measures (gCRF-QS-l1 is 13.99% better thanSVM in precision, 9.77% better in recall and 11.72%4https://traloihay.netbetter in F1 score). We note that both SVM and LR,Model R1 P R1 R R1 F1 R2 P R2 R R2 F1 RL P RL R RL F1SVM 79.2% 52.5% 63.1% 71.9% 41.3% 52.4% 67.1% 36.7% 47.4%LR 75.2%↓ 57.4%↑ 65.1%↑ 66.1%↓ 48.5%↑ 56.0%↑ 61.6%↓ 43.2%↑ 50.8%↑LCRF 78.7%- 61.8%↑ 69.3%- 71.4%- 54.1%↑ 61.6%↑ 67.1%- 49.6%↑ 57.0%↑gCRF 81.9%↑ 65.2%↑ 72.6%↑ 76.8%↑ 57.3%↑ 65.7%↑ 73.9%↑ 53.5%↑ 62.1%↑gCRF-QS 81.4%- 70.0%↑ 75.3%↑ 76.2%- 62.4%↑ 68.6%↑ 73.3%- 58.6%↑ 65.1%↑gCRF-QS-l1 86.6%↑ 68.3%- 76.4%↑ 82.6%↑ 61.5%- 70.5%↑ 80.4%↑ 58.2%- 67.5%↑Table 3: The Precision, Recall and F1 of ROUGE-1, ROUGE-2, ROUGE-L in the baselines SVM,LR, LCRF and ourgeneral CRF based models (gCRF, gCRF-QS, gCRF-QS-l1). The down-arrow means performance degradation withstatistical significance.Model Precision Recall F1precision while not sacrificing the recall measure-SVM 65.93% 61.96% 63.88%ment statistically.LR 66.92%- 61.31%- 63.99%-We also compute the Precision, Recall and F1LCRF 69.80% ↑ 63.91%- 66.73% ↑in ROUGE-1, ROUGE-2 and ROUGE-L measure-gCRF 73.77%↑ 69.43%↑ 71.53%↑ments, which are widely used to measure the qualitygCRF-QS 74.78% ↑ 72.51% ↑ 73.63% ↑of automatic text summarization. The experimentalgCRF-QS-l1 79.92% ↑ 71.73%- 75.60% ↑results are listed in Table 3. All results in the Ta-ble are the average of the ten-fold cross validationTable 2: The Precision, Recall and F1 measures of theexperiments on our dataset.baselines SVM,LR, LCRF and our general CRF basedmodels (gCRF, gCRF-QS, gCRF-QS-l1). The up-arrowIt is observed that our gCRF-QS-l1 model im-denotes the performance improvement compared to theproves the performance in terms of precision, recallprecious method (above) with statistical significance un-and F1 score on all three measurements of ROUGE-der p value of 0.05, the short line ’-’ denotes there is no