1 EVALUATION SETUPTHE VARIABLE INCORRECT IS INCREASED BY 1. AFTER TH...
7.1 Evaluation Setup
the variable incorrect is increased by 1. After the
We use all factoid questions in TREC’s QA test
evaluation process is finished the final version of
sets from 2002 to 2006 for evaluation for which
the pattern given as an example in Figure 1 now
a known answer exists in the AQUAINT corpus.
is:
Additionally, the data in (Lin and Katz, 2005) is
Query: When[1]+was[2]+NP[3]+VERB[4]used. In this paper the authors attempt to identify
Path 3: ⇑pobj⇑prep⇑nsubj⇓prep⇓pobja much more complete set of relevant documents
Path 4: ⇑nsubj⇓prep⇓pobjCorrect: 15for a subset of TREC 2002 questions than TREC
Incorrect: 4itself. We adopt a cross validation approach for
our evaluation. Table 4 shows how the data is split
into five folds.
The variables correct and incorrect are used
In order to evaluate the algorithm’s patterns we
during retrieval, where the score of an answer can-
didate ac is the sum of all scores of all matching
need a set of sentences to which they can be ap-
patterns p:
plied. In a traditional QA system architecture,
Test
Number of Correct Answer Sentences
Mean
Med
set
= 0
<= 1
<= 3
<= 5
<= 10
<= 25
<= 50
>= 75
>= 90
>= 100
2002
0.203
0.396
0.580
0.671
0.809
0.935
0.984
0.0
0.0
0.0
6.86
2.0
2003
0.249
0.429
0.627
0.732
0.828
0.955
0.997
0.003
0.003
0.0
5.67
2.0
2004
0.221
0.368
0.539
0.637
0.799
0.936
0.985
0.0
0.0
0.0
6.51
3.0
2005
0.245
0.404
0.574
0.665
0.777
0.912
0.987
0.0
0.0
0.0
7.56
2.0
2006
0.241
0.389
0.568
0.665
0.807
0.920
0.966
0.006
0.0
0.0
8.04
3.0
Table 2: Fraction of sentences that contain correct answers in Evaluation Set 1 (approximation).2002
0.0
0.074
0.158
0.235
0.342
0.561
0.748
0.172
0.116
0.060
33.46
21.0
2003
0.0
0.099
0.203
0.254
0.356
0.573
0.720
0.161
0.090
0.031
32.88
19.0
2004
0.0
0.073
0.137
0.211
0.328
0.598
0.779
0.142
0.069
0.034
30.82
20.0
2005
0.0
0.163
0.238
0.279
0.410
0.589
0.759
0.141
0.097
0.069
30.87
17.0
2006
0.0
0.125
0.207
0.281
0.415
0.596
0.727
0.173
0.122
0.088
32.93
17.5
Table 3: Fraction of sentences that contain correct answers in Evaluation Set 2 (approximation).In order to provide a quantitative characteriza-
Fold
Training Data
Test Data
sets used
#
set
#
tion of the two evaluation sets we estimated the
1
T03, T04, T05, T06
4565
T02
1159
2
T02, T04, T05, T06, Lin02
6174
T03
1352
number of correct answer sentences they contain.
3
T02, T03, T05, T06, Lin02
6700
T04
826
For each paragraph it was determined whether it
4
T02, T03, T04, T06, Lin02
6298
T05
1228
contained one of the known answer strings and
5
T02, T03, T04, T05, Lin02
6367
T06
1159