DEVELOPED FOR SCORING ANSWERS TO COM-MARIZATION, A PROPER EVALUA...

2005) developed for scoring answers to com-

marization, a proper evaluation that captures the

plex questions is not suitable for our task, since

salient characteristics of our system proved to be

there is no coherent notion of an “answer text”

quite challenging. Overall, evaluation can be de-

that the user reads end–to–end. Furthermore, it

composed into two separate components: locating

is unclear what exactly a “nugget” in this case

a suitable resource to serve as ground truth and

would be. For similar reasons, methodologies for

leveraging it to assess system responses.

summarization evaluation are also of little help.

It is not difficult to find disease-specific pharma-

Typically, system-generated summaries are either

cology resources. We employed Clinical Evidence

evaluated manually by humans (which is expen-

(CE), a periodic report created by the British Med-

sive and time-consuming) or automatically using

ical Journal (BMJ) Publishing Group that summa-

a metric such as R

OUGE

, which compares sys-

rizes the best known drugs for a few dozen dis-

tem output against a number of reference sum-

eases. Note that the existence of such secondary

maries. The interactive nature of our answers vio-

sources does not obviate the need for automated

lates the assumption that systems’ responses are

systems because they are perpetually falling out of

static text segments. Furthermore, it is unclear

date due to rapid advances in medicine. Further-

what exactly should go into a reference summary,

more, such reports are currently created by highly-

because physicians may want varying amounts of

experienced physicians, which is an expensive and

detail depending on familiarity with the disease

time-consuming process.

and patient-specific factors.

For each disease, CE classifies drugs into one of

Evaluation methodologies from information re-

six categories: beneficial, likely beneficial, trade-

trieval are also inappropriate. User studies have

offs (i.e., may have adverse side effects), un-

previously been employed to examine the effect

known, unlikely beneficial, and harmful. Included

of categorized search results. However, they often

with each entry is a list of references—citations

conflate the effectiveness of the interface with that

consulted by the editors in compiling the resource.

of the underlying algorithms. For example, Du-

Although the completeness of the drugs enumer-

mais et al. (2001) found significant differences in

ated in CE is questionable, it nevertheless can be

task performance based on different ways of using

viewed as “authoritative”.

purely presentational devices such as mouseovers,