1 EVALUATION SETUPTHE VARIABLE INCORRECT IS INCREASED BY 1. AFTER TH...

Question

7.1 Evaluation Setupthe variable incorrect is increased by 1. After theWe use all factoid questions in TREC’s QA testevaluation process is finished the final version ofsets from 2002 to 2006 for evaluation for whichthe pattern given as an example in Figure 1 nowa known answer exists in the AQUAINT corpus.is:Additionally, the data in (Lin and Katz, 2005) isQuery: When[1]+was[2]+NP[3]+VERB[4]used. In this paper the authors attempt to identifyPath 3: ⇑pobj⇑prep⇑nsubj⇓prep⇓pobja much more complete set of relevant documentsPath 4: ⇑nsubj⇓prep⇓pobjCorrect: 15for a subset of TREC 2002 questions than TRECIncorrect: 4itself. We adopt a cross validation approach forour evaluation. Table 4 shows how the data is splitinto five folds.The variables correct and incorrect are usedIn order to evaluate the algorithm’s patterns weduring retrieval, where the score of an answer can-didate ac is the sum of all scores of all matchingneed a set of sentences to which they can be ap-patterns p:plied. In a traditional QA system architecture,Test Number of Correct Answer SentencesMean Medset = 0 <= 1 <= 3 <= 5 <= 10 <= 25 <= 50 >= 75 >= 90 >= 1002002 0.203 0.396 0.580 0.671 0.809 0.935 0.984 0.0 0.0 0.0 6.86 2.02003 0.249 0.429 0.627 0.732 0.828 0.955 0.997 0.003 0.003 0.0 5.67 2.02004 0.221 0.368 0.539 0.637 0.799 0.936 0.985 0.0 0.0 0.0 6.51 3.02005 0.245 0.404 0.574 0.665 0.777 0.912 0.987 0.0 0.0 0.0 7.56 2.02006 0.241 0.389 0.568 0.665 0.807 0.920 0.966 0.006 0.0 0.0 8.04 3.0Table 2: Fraction of sentences that contain correct answers in Evaluation Set 1 (approximation).2002 0.0 0.074 0.158 0.235 0.342 0.561 0.748 0.172 0.116 0.060 33.46 21.02003 0.0 0.099 0.203 0.254 0.356 0.573 0.720 0.161 0.090 0.031 32.88 19.02004 0.0 0.073 0.137 0.211 0.328 0.598 0.779 0.142 0.069 0.034 30.82 20.02005 0.0 0.163 0.238 0.279 0.410 0.589 0.759 0.141 0.097 0.069 30.87 17.02006 0.0 0.125 0.207 0.281 0.415 0.596 0.727 0.173 0.122 0.088 32.93 17.5Table 3: Fraction of sentences that contain correct answers in Evaluation Set 2 (approximation).In order to provide a quantitative characteriza-Fold Training Data Test Datasets used # set #tion of the two evaluation sets we estimated the1 T03, T04, T05, T06 4565 T02 11592 T02, T04, T05, T06, Lin02 6174 T03 1352number of correct answer sentences they contain.3 T02, T03, T05, T06, Lin02 6700 T04 826For each paragraph it was determined whether it4 T02, T03, T04, T06, Lin02 6298 T05 1228contained one of the known answer strings and5 T02, T03, T04, T05, Lin02 6367 T06 1159at least of one of the question key words. Ta-Table 4: Splits into training and tests sets of the datables 2 and 3 show for each evaluation set howused for evaluation. T02 stands for TREC 2002 datamany answers on average it contains per ques-etc. Lin02 is based on (Lin and Katz, 2005). The #tion. The column “= 0” for example shows therows show how many question/answer sentence pairsfraction of questions for which no valid answerare used for training and for testing.sentence is contained in the evaluation set, whilecolumn “>= 90” gives the fraction of questionssee e.g. (Prager, 2006; Voorhees, 2003), the docu-with 90 or more valid answer sentences. The lastment or passage retrieval step performs this func-two columns show mean and median values.tion. This step is crucial to a QA system’s per-formance, because it is impossible to locate an-

1 EVALUATION SETUPTHE VARIABLE INCORRECT IS INCREASED BY 1. AFTER TH...

7.1 Evaluation Setup

the variable incorrect is increased by 1. After the

We use all factoid questions in TREC’s QA test

evaluation process is finished the final version of

sets from 2002 to 2006 for evaluation for which

the pattern given as an example in Figure 1 now

a known answer exists in the AQUAINT corpus.

is:

Additionally, the data in (Lin and Katz, 2005) is

used. In this paper the authors attempt to identify

a much more complete set of relevant documents

for a subset of TREC 2002 questions than TREC

itself. We adopt a cross validation approach for

our evaluation. Table 4 shows how the data is split

into five folds.

The variables correct and incorrect are used

In order to evaluate the algorithm’s patterns we

during retrieval, where the score of an answer can-

didate ac is the sum of all scores of all matching

need a set of sentences to which they can be ap-

patterns p:

plied. In a traditional QA system architecture,

In order to provide a quantitative characteriza-

tion of the two evaluation sets we estimated the

number of correct answer sentences they contain.

For each paragraph it was determined whether it

contained one of the known answer strings and

at least of one of the question key words. Ta-

bles 2 and 3 show for each evaluation set how

many answers on average it contains per ques-

tion. The column “= 0” for example shows the

fraction of questions for which no valid answer

sentence is contained in the evaluation set, while

column “>= 90” gives the fraction of questions

see e.g. (Prager, 2006; Voorhees, 2003), the docu-

with 90 or more valid answer sentences. The last

ment or passage retrieval step performs this func-

two columns show mean and median values.

tion. This step is crucial to a QA system’s per-

formance, because it is impossible to locate an-

Bạn đang xem 7. - BÁO CÁO KHOA HỌC ANSWER SENTENCE RETRIEVAL BY MATCHING DEPENDENCY PATHS ACQUIRED FROM QUESTION ANSWER SENTENCE PAIRS PDF