SECTION 5 GIVES THE RESULTS OF A NUMBER OF EXPERI-KNOWLEDGE ABOUT THE...

28) is substituted at the place of

!

. In thisAsp only in the cases when Qsp is present. The sta-way an answer validity score of 55.5 is calculated.tistical evidence for this can be measured throughIt turns out that this value is the maximal validity

"# ?h

!

, however this value is corrected withscore for all the answers of this question. Other cor-

Rij

in the denominator, to avoid the cases

"#

rect answers from the TREC-2001 collection con-when high-frequency words and patterns are takentain as name entity “Mississippi”. Their answer va-as relevant answers.lidity score is 11.8, which is greater than 1.2 andalso greater than

m-noBk7

&

'qpXr

srutv

w<xSy*z*+

"# ?h

dkd"#

8$

"#

${WHWHn|W,

. This score (i.e. 11.8) classifies them asrelevant answers. On the other hand, all the wrongFor CCP we obtain:answers has validity score below 1 and as a resultall of them are classified as irrelevant answer candi-

!k1234e !

'"()+,

7

&

dates.

!879!

5 Experiments and DiscussionTables 2 and 3 report the results of the automaticanswer validation experiments obtained respectivelyA number of experiments have been carried out inon all the TREC-2001 questions and on the subsetorder to check the validity of the proposed answerof definition and generic questions. For each esti-validation technique. As a data set, the 492 ques-mation method we report precision, recall and suc-tions of the TREC-2001 database have been used.cess rate. Success rate best represents the perfor-For each question, at most three correct answers andmance of the system, being the percent of [

] pairsthree wrong answers have been randomly selectedwhere the result given by the system is the same asfrom the TREC-2001 participants’ submissions, re-the TREC judges’ opinion. Precision is the percentsulting in a corpus of 2726 question-answer pairsof

pairs estimated by the algorithm as rele-(some question have less than three positive answersvant, for which the opinion of TREC judges was thein the corpus). As said before, AltaVista was used assame. Recall shows the percent of the relevant an-search engine.swers which the system also evaluates as relevant.A baseline for the answer validation experimentwas defined by considering how often an answer oc-P (%) R (%) SR (%)curs in the top 10 documents among those (1000Baseline 50.86 4.49 52.99for each question) provided by NIST to TREC-2001CCP - rel. 77.85 82.60 81.25participants. An answer was judged correct for aCCP - abs. 74.12 81.31 78.42question if it appears at least one time in the firstPMI - rel. 77.40 78.27 79.5610 documents retrieved for that question, otherwisePMI - abs. 70.95 87.17 77.79it was judged not correct. Baseline results are re-MLHR - rel. 81.23 72.40 79.60ported in Table 2.MLHR - abs. 72.80 80.80 77.40We carried out several experiments in order tocheck a number of working hypotheses. Three in-Table 2: Results on all 492 TREC-2001 questionsdependent factors were considered:Estimation method. We have implemented threemeasures (reported in Section 4.2) to estimate an an-swer validity score: PMI, MLHR and CCP.CCP - rel. 85.12 84.27 86.38CCP - abs. 83.07 78.81 83.35Threshold. We wanted to estimate the role of twoPMI - rel. 83.78 82.12 84.90different kinds of thresholds for the assessment ofPMI - abs. 79.56 84.44 83.35answer validation. In the case of an absolute thresh-MLHR - rel. 90.65 72.75 84.44old, if the answer validity score for a candidate an-MLHR - abs. 87.20 67.20 82.10swer is below the threshold, the answer is consideredwrong, otherwise it is accepted as relevant. In a sec-Table 3: Results on 249 named entity questionsond type of experiment, for every question and itscorresponding answers the program chooses the an-The best results on the 492 questions corpus (CCPswer with the highest validity score and calculates ameasure with relative threshold) show a success raterelative threshold on that basis (i.e.

z*+.,y*rt}$

of 81.25%, i.e. in 81.25% of the pairs the system

K

7

&

evaluation corresponds to the human evaluation, and

'

srqtv

,xSy*z*+

). However the relativethreshold should be larger than a certain minimumconfirms the initial working hypotheses. This is 28%value.above the baseline success rate. Precision and re-call are respectively 20-30% and 68-87% above theQuestion type. We wanted to check performancebaseline values. These results demonstrate that thevariation based on different types of TREC-2001intuition behind the approach is motivated and thatquestions. In particular, we have separated defini-the algorithm provides a workable solution for an-tion and generic questions from true named entitiesswer validation.questions.The experiments show that the average differencebetween the success rates obtained for the namedsion of a QA system by 25-30%. A common featureof these approaches is the use of the Web to intro-entity questions (Table 3) and the full TREC-2001duce data redundancy for a more reliable answer ex-question set (Table 2) is 5.1%. This means that ourapproach performs better when the answer entitiestraction from local text collections. (Radev et al.,