1 EVALUATION SETUPTHE VARIABLE INCORRECT IS INCREASED BY 1. AFTER TH...

7.1 Evaluation Setup

the variable incorrect is increased by 1. After the

We use all factoid questions in TREC’s QA test

evaluation process is finished the final version of

sets from 2002 to 2006 for evaluation for which

the pattern given as an example in Figure 1 now

a known answer exists in the AQUAINT corpus.

is:

Additionally, the data in (Lin and Katz, 2005) is

Query: When[1]+was[2]+NP[3]+VERB[4]

used. In this paper the authors attempt to identify

Path 3: ⇑pobj⇑prep⇑nsubj⇓prep⇓pobj

a much more complete set of relevant documents

Path 4: ⇑nsubj⇓prep⇓pobjCorrect: 15

for a subset of TREC 2002 questions than TREC

Incorrect: 4

itself. We adopt a cross validation approach for

our evaluation. Table 4 shows how the data is split

into five folds.

The variables correct and incorrect are used

In order to evaluate the algorithm’s patterns we

during retrieval, where the score of an answer can-

didate ac is the sum of all scores of all matching

need a set of sentences to which they can be ap-

patterns p:

plied. In a traditional QA system architecture,

Test

Number of Correct Answer Sentences

Mean

Med

set

= 0

<= 1

<= 3

<= 5

<= 10

<= 25

<= 50

>= 75

>= 90

>= 100

2002

0.203

0.396

0.580

0.671

0.809

0.935

0.984

0.0

0.0

0.0

6.86

2.0

2003

0.249

0.429

0.627

0.732

0.828

0.955

0.997

0.003

0.003

0.0

5.67

2.0

2004

0.221

0.368

0.539

0.637

0.799

0.936

0.985

0.0

0.0

0.0

6.51

3.0

2005

0.245

0.404

0.574

0.665

0.777

0.912

0.987

0.0

0.0

0.0

7.56

2.0

2006

0.241

0.389

0.568

0.665

0.807

0.920

0.966

0.006

0.0

0.0

8.04

3.0

Table 2: Fraction of sentences that contain correct answers in Evaluation Set 1 (approximation).

2002

0.0

0.074

0.158

0.235

0.342

0.561

0.748

0.172

0.116

0.060

33.46

21.0

2003

0.0

0.099

0.203

0.254

0.356

0.573

0.720

0.161

0.090

0.031

32.88

19.0

2004

0.0

0.073

0.137

0.211

0.328

0.598

0.779

0.142

0.069

0.034

30.82

20.0

2005

0.0

0.163

0.238

0.279

0.410

0.589

0.759

0.141

0.097

0.069

30.87

17.0

2006

0.0

0.125

0.207

0.281

0.415

0.596

0.727

0.173

0.122

0.088

32.93

17.5

Table 3: Fraction of sentences that contain correct answers in Evaluation Set 2 (approximation).

In order to provide a quantitative characteriza-

Fold

Training Data

Test Data

sets used

#

set

#

tion of the two evaluation sets we estimated the

1

T03, T04, T05, T06

4565

T02

1159

2

T02, T04, T05, T06, Lin02

6174

T03

1352

number of correct answer sentences they contain.

3

T02, T03, T05, T06, Lin02

6700

T04

826

For each paragraph it was determined whether it

4

T02, T03, T04, T06, Lin02

6298

T05

1228

contained one of the known answer strings and

5

T02, T03, T04, T05, Lin02

6367

T06

1159

at least of one of the question key words. Ta-

Table 4: Splits into training and tests sets of the data

bles 2 and 3 show for each evaluation set how

used for evaluation. T02 stands for TREC 2002 data

many answers on average it contains per ques-

etc. Lin02 is based on (Lin and Katz, 2005). The #

tion. The column “= 0” for example shows the

rows show how many question/answer sentence pairs

fraction of questions for which no valid answer

are used for training and for testing.

sentence is contained in the evaluation set, while

column “>= 90” gives the fraction of questions

see e.g. (Prager, 2006; Voorhees, 2003), the docu-

with 90 or more valid answer sentences. The last

ment or passage retrieval step performs this func-

two columns show mean and median values.

tion. This step is crucial to a QA system’s per-

formance, because it is impossible to locate an-