1 MANUAL EVALUATION OF SHORT ANSWERSCLINICAL EVIDENCE TO CREATE A TE...

6.1 Manual Evaluation of Short Answers

Clinical Evidence to create a test collection for

system evaluation. We randomly selected thirty

In our manual evaluation, system outputs were as-

diseases, generating a development set of five

sessed as if they were answers to factoid ques-

questions and a test set of twenty-five questions.

tions. We gathered three different sets of answers.

Some examples include: acute asthma, chronic

For the baseline, we used the main intervention

prostatitis, community acquired pneumonia, and

from each of the first three PubMed citations. For

erectile dysfunction. CE listed an average of 11.3

our test condition, we considered the three largest

interventions per disease; of those, 2.3 on average

clusters, taking the main intervention from the first

were marked as beneficial and 1.9 as likely benefi-

abstract in each cluster. This yields three drugs

cial. On average, there were 48.4 references asso-

that are at the same level of ontological granularity

ciated with each disease, representing the articles

as those extracted from the unclustered PubMed

consulted during the compilation of CE itself. Of

citations. For our third condition, we assumed the

those, 34.7 citations on average appeared in MED-

existence of an oracle which selects the three best

LINE; we gathered all these abstracts, which serve

clusters (as determined by the first author, a med-

as the reference summaries for our R

OUGE

-based

ical doctor). From each of these three clusters,

automatic evaluation.

we extracted the main intervention of the first ab-

Since the focus of our work is not on retrieval al-

stracts. This oracle condition represents an achiev-

gorithms per se, we employed PubMed to fetch an

able upper bound with a human in the loop. Physi-

initial set of MEDLINE citations and performed

cians are highly-trained professionals that already

answer synthesis using those results. The PubMed

have significant domain knowledge. Faced with a

citations also serve as a baseline, since it repre-

small number of choices, it is likely that they will

sents a system commonly used by physicians.

be able to select the most promising cluster, even

if they did not previously know it.

In order to obtain the best possible set of ci-

This preparation yielded up to nine drug names,

tations, the first author (an experienced PubMed

three from each experimental condition. For short,

searcher), manually formulated queries, taking

we refer to these as PubMed, Cluster, and Oracle,

advantage of MeSH (Medical Subject Headings)

respectively. After blinding the source of the drugs

terms when available. MeSH terms are controlled

and removing duplicates, each short answer was

vocabulary concepts assigned manually by trained

medical indexers (based on the full text of the ar-

presented to the first author for evaluation. Since

Clinical Evidence Physician

B LB T U UB H N Good Okay Bad

PubMed 0.200 0.213 0.160 0.053 0.000 0.013 0.360 0.600 0.227 0.173

Cluster 0.387 0.173 0.173 0.027 0.000 0.000 0.240 0.827 0.133 0.040

Oracle 0.400 0.200 0.133 0.093 0.013 0.000 0.160 0.893 0.093 0.013

Table 2: Manual evaluation of short answers: distribution of system answers with respect to CE cat-

egories (left side) and with respect to the assessor’s own expertise (right side). (Key: B=beneficial,

LB=likely beneficial, T=tradeoffs, U=unknown, UB=unlikely beneficial, H=harmful, N=not in CE)