2005) developed for scoring answers to com-
marization, a proper evaluation that captures the
plex questions is not suitable for our task, since
salient characteristics of our system proved to be
there is no coherent notion of an “answer text”
quite challenging. Overall, evaluation can be de-
that the user reads end–to–end. Furthermore, it
composed into two separate components: locating
is unclear what exactly a “nugget” in this case
a suitable resource to serve as ground truth and
would be. For similar reasons, methodologies for
leveraging it to assess system responses.
summarization evaluation are also of little help.
It is not difficult to find disease-specific pharma-
Typically, system-generated summaries are either
cology resources. We employed Clinical Evidence
evaluated manually by humans (which is expen-
(CE), a periodic report created by the British Med-
sive and time-consuming) or automatically using
ical Journal (BMJ) Publishing Group that summa-
a metric such as R
OUGE, which compares sys-
rizes the best known drugs for a few dozen dis-
tem output against a number of reference sum-
eases. Note that the existence of such secondary
maries. The interactive nature of our answers vio-
sources does not obviate the need for automated
lates the assumption that systems’ responses are
systems because they are perpetually falling out of
static text segments. Furthermore, it is unclear
date due to rapid advances in medicine. Further-
what exactly should go into a reference summary,
more, such reports are currently created by highly-
because physicians may want varying amounts of
experienced physicians, which is an expensive and
detail depending on familiarity with the disease
time-consuming process.
and patient-specific factors.
For each disease, CE classifies drugs into one of
Evaluation methodologies from information re-
six categories: beneficial, likely beneficial, trade-
trieval are also inappropriate. User studies have
offs (i.e., may have adverse side effects), un-
previously been employed to examine the effect
known, unlikely beneficial, and harmful. Included
of categorized search results. However, they often
with each entry is a list of references—citations
conflate the effectiveness of the interface with that
consulted by the editors in compiling the resource.
of the underlying algorithms. For example, Du-
Although the completeness of the drugs enumer-
mais et al. (2001) found significant differences in
ated in CE is questionable, it nevertheless can be
task performance based on different ways of using
viewed as “authoritative”.
purely presentational devices such as mouseovers,
Bạn đang xem 2005) - TÀI LIỆU BÁO CÁO KHOA HỌC ANSWER EXTRACTION SEMANTIC CLUSTERING AND EXTRACTIVE SUMMARIZATION FOR CLINICAL QUESTION ANSWERING PDF