6.1 Manual Evaluation of Short Answers
Clinical Evidence to create a test collection for
system evaluation. We randomly selected thirty
In our manual evaluation, system outputs were as-
diseases, generating a development set of five
sessed as if they were answers to factoid ques-
questions and a test set of twenty-five questions.
tions. We gathered three different sets of answers.
Some examples include: acute asthma, chronic
For the baseline, we used the main intervention
prostatitis, community acquired pneumonia, and
from each of the first three PubMed citations. For
erectile dysfunction. CE listed an average of 11.3
our test condition, we considered the three largest
interventions per disease; of those, 2.3 on average
clusters, taking the main intervention from the first
were marked as beneficial and 1.9 as likely benefi-
abstract in each cluster. This yields three drugs
cial. On average, there were 48.4 references asso-
that are at the same level of ontological granularity
ciated with each disease, representing the articles
as those extracted from the unclustered PubMed
consulted during the compilation of CE itself. Of
citations. For our third condition, we assumed the
those, 34.7 citations on average appeared in MED-
existence of an oracle which selects the three best
LINE; we gathered all these abstracts, which serve
clusters (as determined by the first author, a med-
as the reference summaries for our R
OUGE-based
ical doctor). From each of these three clusters,
automatic evaluation.
we extracted the main intervention of the first ab-
Since the focus of our work is not on retrieval al-
stracts. This oracle condition represents an achiev-
gorithms per se, we employed PubMed to fetch an
able upper bound with a human in the loop. Physi-
initial set of MEDLINE citations and performed
cians are highly-trained professionals that already
answer synthesis using those results. The PubMed
have significant domain knowledge. Faced with a
citations also serve as a baseline, since it repre-
small number of choices, it is likely that they will
sents a system commonly used by physicians.
be able to select the most promising cluster, even
if they did not previously know it.
In order to obtain the best possible set of ci-
This preparation yielded up to nine drug names,
tations, the first author (an experienced PubMed
three from each experimental condition. For short,
searcher), manually formulated queries, taking
we refer to these as PubMed, Cluster, and Oracle,
advantage of MeSH (Medical Subject Headings)
respectively. After blinding the source of the drugs
terms when available. MeSH terms are controlled
and removing duplicates, each short answer was
vocabulary concepts assigned manually by trained
medical indexers (based on the full text of the ar-
presented to the first author for evaluation. Since
Clinical Evidence Physician
B LB T U UB H N Good Okay Bad
PubMed 0.200 0.213 0.160 0.053 0.000 0.013 0.360 0.600 0.227 0.173
Cluster 0.387 0.173 0.173 0.027 0.000 0.000 0.240 0.827 0.133 0.040
Oracle 0.400 0.200 0.133 0.093 0.013 0.000 0.160 0.893 0.093 0.013
Table 2: Manual evaluation of short answers: distribution of system answers with respect to CE cat-
egories (left side) and with respect to the assessor’s own expertise (right side). (Key: B=beneficial,
LB=likely beneficial, T=tradeoffs, U=unknown, UB=unlikely beneficial, H=harmful, N=not in CE)
Bạn đang xem 6. - TÀI LIỆU BÁO CÁO KHOA HỌC ANSWER EXTRACTION SEMANTIC CLUSTERING AND EXTRACTIVE SUMMARIZATION FOR CLINICAL QUESTION ANSWERING PDF