3). FROM NOW ON WE WILL REFER TO THISSWERS AS THE SIZE OF T...

Question

Section 3.3). From now on we will refer to thisswers as the size of T r Q increases (X-axis). Thesecond version of the dataset as the “filtered ver-experiment started from a training set of size 100sion”.and was repeated adding 300 examples at a time3.2 Quality assessinguntil precision started decreasing. With each in-crease in training set size, the experiment was re-In Section 2.1 we claimed to be able to identifypeated ten times and average precision values werehigh quality content. To demonstrate it, we con-calculated. In all runs, training examples wereducted a set of experiments on the original unfil-picked randomly from the unfiltered dataset de-tered dataset to establish whether the feature spacescribed in Section 3.1; for details on T r Q see Sec-Ψ was powerful enough to capture the quality oftion 2.1. A training set of 12,000 examples wasanswers; our specific objective was to estimate thechosen for the summarization experiments.8Being too easy to summarize or not requiring any sum-9Performed with Weka 3.7.0 available at https://traloihay.net.marization at all, those questions wouldn’t constitute an valu-able test of the system’s ability to extract information.cs.waikato.ac.nz/˜ml/wekaSystem a?(baseline) SΣ SΠROUGE-1 R 51.7% 67.3% 67.4%ROUGE-1 P 62.2% 54.0% 71.2%ROUGE-1 F 52.9% 59.3% 66.1%ROUGE-2 R 40.5% 52.2% 58.8%ROUGE-2 P 49.0% 41.4% 63.1%ROUGE-2 F 41.6% 45.9% 57.9%ROUGE-L R 50.3% 65.1% 66.3%ROUGE-L P 60.5% 52.3% 70.7%ROUGE-L F 51.5% 57.3% 65.1%Table 1: Summarization Evaluation on filtered dataset (re-fer to Section 3.1 for details). ROUGE-L, ROUGE-1 andFigure 2: Increase in ROUGE-L, ROUGE-1 and ROUGE-ROUGE-2 are presented; for each, Recall (R), Precision (P)2 performances of the SΠsystem as more measures are takenand F-1 score (F) are given.in consideration in the scoring function, starting from Rele-vance alone (R) to the complete system (RQNC). F-1 scores3.3 Evaluating answer summariesare given.The objective of our work was to summarize an-swers from cQA portals. Two systems were de-from the enforcement of a more stringent lengthsigned: Table 1 shows the performances usingconstraint than the one proposed in (8). Furtherfunction S Σ (see equation (7)), and function S Πpotential improvements on S Σ could be obtained(see equation (6)). The chosen best answer a ?by choosing a classifier able to learn a more ex-was used as a baseline. We calculated ROUGE-1pressive underlying function.and ROUGE-2 scores 10 against human annotationIn order to determine what influence the singleon the filtered version of the dataset presented inmeasures had on the overall performance, we con-

3). FROM NOW ON WE WILL REFER TO THISSWERS AS THE SIZE OF T...

Section 3.3). From now on we will refer to this

swers as the size of T r Q increases (X-axis). The

second version of the dataset as the “filtered ver-

experiment started from a training set of size 100

sion”.

and was repeated adding 300 examples at a time

3.2 Quality assessing

until precision started decreasing. With each in-

crease in training set size, the experiment was re-

In Section 2.1 we claimed to be able to identify

peated ten times and average precision values were

high quality content. To demonstrate it, we con-

calculated. In all runs, training examples were

ducted a set of experiments on the original unfil-

picked randomly from the unfiltered dataset de-

tered dataset to establish whether the feature space

scribed in Section 3.1; for details on T r Q see Sec-

Ψ was powerful enough to capture the quality of

tion 2.1. A training set of 12,000 examples was

answers; our specific objective was to estimate the

chosen for the summarization experiments.

Being too easy to summarize or not requiring any sum-

Performed with Weka 3.7.0 available at https://traloihay.net.

marization at all, those questions wouldn’t constitute an valu-

able test of the system’s ability to extract information.

cs.waikato.ac.nz/˜ml/weka

System a

(baseline) S

S

ROUGE-1 R 51.7% 67.3% 67.4%

ROUGE-1 P 62.2% 54.0% 71.2%

ROUGE-1 F 52.9% 59.3% 66.1%

ROUGE-2 R 40.5% 52.2% 58.8%

ROUGE-2 P 49.0% 41.4% 63.1%

ROUGE-2 F 41.6% 45.9% 57.9%

ROUGE-L R 50.3% 65.1% 66.3%

ROUGE-L P 60.5% 52.3% 70.7%

ROUGE-L F 51.5% 57.3% 65.1%

Table 1: Summarization Evaluation on filtered dataset (re-

fer to Section 3.1 for details). ROUGE-L, ROUGE-1 and

Figure 2: Increase in ROUGE-L, ROUGE-1 and ROUGE-

ROUGE-2 are presented; for each, Recall (R), Precision (P)

2 performances of the S

system as more measures are taken

and F-1 score (F) are given.

in consideration in the scoring function, starting from Rele-

vance alone (R) to the complete system (RQNC). F-1 scores

3.3 Evaluating answer summaries

are given.

The objective of our work was to summarize an-

swers from cQA portals. Two systems were de-

from the enforcement of a more stringent length

signed: Table 1 shows the performances using

constraint than the one proposed in (8). Further

function S Σ (see equation (7)), and function S Π

potential improvements on S Σ could be obtained

(see equation (6)). The chosen best answer a ?

by choosing a classifier able to learn a more ex-

was used as a baseline. We calculated ROUGE-1

pressive underlying function.

and ROUGE-2 scores 10 against human annotation

In order to determine what influence the single

on the filtered version of the dataset presented in

measures had on the overall performance, we con-

Bạn đang xem section 3. - BÁO CÁO KHOA HỌC METADATA AWARE MEASURES FOR ANSWER SUMMARIZATION IN COMMUNITY QUESTION ANSWERING PDF

swers as the size of T r ^Q increases (X-axis). The

scribed in Section 3.1; for details on T r ^Q see Sec-

function S ^Σ (see equation (7)), and function S ^Π

potential improvements on S ^Σ could be obtained

(see equation (6)). The chosen best answer a ^?

and ROUGE-2 scores ¹⁰ against human annotation