3). FROM NOW ON WE WILL REFER TO THISSWERS AS THE SIZE OF T...

Section 3.3). From now on we will refer to this

swers as the size of T r Q increases (X-axis). The

second version of the dataset as the “filtered ver-

experiment started from a training set of size 100

sion”.

and was repeated adding 300 examples at a time

3.2 Quality assessing

until precision started decreasing. With each in-

crease in training set size, the experiment was re-

In Section 2.1 we claimed to be able to identify

peated ten times and average precision values were

high quality content. To demonstrate it, we con-

calculated. In all runs, training examples were

ducted a set of experiments on the original unfil-

picked randomly from the unfiltered dataset de-

tered dataset to establish whether the feature space

scribed in Section 3.1; for details on T r Q see Sec-

Ψ was powerful enough to capture the quality of

tion 2.1. A training set of 12,000 examples was

answers; our specific objective was to estimate the

chosen for the summarization experiments.

8

Being too easy to summarize or not requiring any sum-

9

Performed with Weka 3.7.0 available at https://traloihay.net.

marization at all, those questions wouldn’t constitute an valu-

able test of the system’s ability to extract information.

cs.waikato.ac.nz/˜ml/weka

System a

?

(baseline) S

Σ

S

Π

ROUGE-1 R 51.7% 67.3% 67.4%

ROUGE-1 P 62.2% 54.0% 71.2%

ROUGE-1 F 52.9% 59.3% 66.1%

ROUGE-2 R 40.5% 52.2% 58.8%

ROUGE-2 P 49.0% 41.4% 63.1%

ROUGE-2 F 41.6% 45.9% 57.9%

ROUGE-L R 50.3% 65.1% 66.3%

ROUGE-L P 60.5% 52.3% 70.7%

ROUGE-L F 51.5% 57.3% 65.1%

Table 1: Summarization Evaluation on filtered dataset (re-

fer to Section 3.1 for details). ROUGE-L, ROUGE-1 and

Figure 2: Increase in ROUGE-L, ROUGE-1 and ROUGE-

ROUGE-2 are presented; for each, Recall (R), Precision (P)

2 performances of the S

Π

system as more measures are taken

and F-1 score (F) are given.

in consideration in the scoring function, starting from Rele-

vance alone (R) to the complete system (RQNC). F-1 scores

3.3 Evaluating answer summaries

are given.

The objective of our work was to summarize an-

swers from cQA portals. Two systems were de-

from the enforcement of a more stringent length

signed: Table 1 shows the performances using

constraint than the one proposed in (8). Further

function S Σ (see equation (7)), and function S Π

potential improvements on S Σ could be obtained

(see equation (6)). The chosen best answer a ?

by choosing a classifier able to learn a more ex-

was used as a baseline. We calculated ROUGE-1

pressive underlying function.

and ROUGE-2 scores 10 against human annotation

In order to determine what influence the single

on the filtered version of the dataset presented in

measures had on the overall performance, we con-