. IN OUR APPLICATION, THIS IS INADEQUATEFOR MANY REASONS

2005). In our application, this is inadequate

for many reasons: there is rarely an exact string

sentence. The “outcome”, another term from

match between system output and drugs men-

evidence-based medicine, asserts the clinical find-

tioned in CE, primarily due to synonymy (for ex-

ings of a study, and is typically found towards

the end of a MEDLINE abstract. In our case,

ample, alpha-adrenergic blocking agent and α-

blocker refer to the same class of drugs) and on-

outcome sentences state the efficacy of a drug in

tological mismatch (for example, CE might men-

treating a particular disease. Previously, we have

tion beta-agonists, while a retrieved abstract dis-

built an outcome extractor capable of identifying

such sentences in MEDLINE abstracts using su-

cusses formoterol, which is a specific represen-

tative of beta-agonists). Furthermore, while this

pervised machine learning techniques (Demner-

Fushman and Lin, 2005; Demner-Fushman and

evaluation method can tell us if the drugs proposed

by the system are “good”, it cannot measure how

Lin, 2006 in press). Evaluation on a blind held-

out test set shows high classification accuracy.

well the answer is supported by MEDLINE cita-

tions; recall that answer justification is important

5 Evaluation Methodology

for physicians.

The nugget evaluation methodology (Voorhees,

Given that our work draws from QA, IR, and sum-