SECTION 3.3. A BENEFIT, SINCE IT GIVES MORE OPPORTUNITY FOR ENFORC-

2003), there was an evaluation of responses accord-

dent Marcos”; “French” vs. “France”), and fails to

ing to systems’ confidences in their own answers,

combine them, it is observed that as long as either

using the Average Precision (AP) metric. This is an

one is in first place then the question is correct and

important consideration, since it is generally better

might not attract more attention from developers. It

for a system to say “I don’t know” than to give a

is only when neither is initially in first place, but

wrong answer. On the TREC12 questions set, our

combining the scores of correct candidates boosts

AP score increased 2.1% with Constraints, using the

one to first place that the failure to merge them is

algorithm we presented in (Chu-Carroll et al. 2002).

relevant. However, in the context of our system, we

are comparing the pivot term from the original ques-