6.2 Automatic Evaluation of Abstracts
the assessor had no idea from which condition an
answer came, this process guarded against asses-
A major limitation of the factoid-based evaluation
sor bias.
methodology is that it does not measure the qual-
Each answer was evaluated in two different
ity of the abstracts from which the short answers
ways: first, with respect to the ground truth in CE,
were extracted. Since we lacked the necessary
and second, using the assessor’s own medical ex-
resources to manually gather abstract-level judg-
pertise. In the first set of judgments, the asses-
ments for evaluation, we sought an alternative.
sor determined which of the six categories (ben-
Fortunately, CE can be leveraged to assess the
eficial, likely beneficial, tradeoffs, unknown, un-
“goodness” of abstracts automatically. We assume
likely beneficial, harmful) the system answer be-
that references cited in CE are examples of high
longed to, based on the CE recommendations. As
quality abstracts, since they were used in gener-
we have discussed previously, a human (with suf-
ating the drug recommendations. Following stan-
ficient domain knowledge) is required to perform
dard assumptions made in summarization evalu-
this matching due to synonymy and differences in
ation, we considered abstracts that are similar in
ontological granularity. However, note that the as-
content with these “reference abstracts” to also be
sessor only considered the drug name when mak-
“good” (i.e., relevant). Similarity in content can
ing this categorization. In the second set of judg-
be quantified with R
OUGE
.
ments, the assessor separately determined if the
Since physicians demand high precision, we as-
short answer was “good”, “okay” (marginal), or
sess the cumulative relevance after the first, sec-
“bad” based both on CE and her own experience,
ond, and third abstract that the clinician is likely
taking into account the abstract title and the top-
to have examined (where the relevance for each
scoring outcome sentence (and if necessary, the
individual abstract is given by its R
OUGE
-1 pre-
entire abstract text).
cision score). For the baseline PubMed condition,
Results of this manual evaluation are presented
the examined abstracts simply correspond to the
in Table 2, which shows the distribution of judg-
first three hits in the result set. For our test system,
ments for the three experimental conditions. For
we developed three different orderings. The first,
baseline PubMed, 20% of the examined drugs fell
which we term cluster round-robin, selects the first
in the beneficial category; the values are 39% for
abstract from the top three clusters (by size). The
the Cluster condition and 40% for the Oracle con-
second, which we term oracle cluster order, selects
dition. In terms of short answers, our system
three abstracts from the best cluster, assuming the
returns approximately twice as many beneficial
existence of an oracle that informs the system. The
drugs as the baseline, a marked increase in answer
third, which we term oracle round-robin, selects
accuracy. Note that a large fraction of the drugs
the first abstract from each of the three best clus-
evaluated were not found in CE at all, which pro-
ters (also determined by an oracle).
vides an estimate of its coverage. In terms of the
Results of this evaluation are shown in Table 3.
assessor’s own judgments, 60% of PubMed short
The columns show the cumulative relevance (i.e.,
answers were found to be “good”, compared to
83% and 89% for the Cluster and Oracle condi-
R
OUGE
score) after examining the first, second,
tions, respectively. From a factoid QA point of
and third abstract, under the different ordering
view, we can conclude that our system outper-
conditions. To determine statistical significance,
forms the PubMed baseline.
we applied the Wilcoxon signed-rank test, the
Rank 1 Rank 2 Rank 3
PubMed Ranked List 0.170 0.349 0.523
Cluster Round-Robin 0.181 (+6.3%)
◦
0.356 (+2.1%)
◦
0.526 (+0.5%)
◦
Oracle Cluster Order 0.206 (+21.5%)
M
0.392 (+12.6%)
M
0.597 (+14.0%)
N
Oracle Round-Robin 0.206 (+21.5%)
M
0.396 (+13.6%)
M
0.586 (+11.9%)
N
Table 3: Cumulative relevance after examining the first, second, and third abstracts, according to different
orderings. (
◦
denotes n.s.,
M
denotes sig. at 0.90,
N
denotes sig. at 0.95)
comes, on which our extractive summaries are
standard non-parametric test for applications of
this type. Due to the relatively small test set (only
based. As a point of comparison, we also im-
25 questions), the increase in cumulative relevance
plemented a purely term-based approach to clus-
tering PubMed citations. The results are so inco-
exhibited by the cluster round-robin condition is
herent that a formal evaluation would prove to be
not statistically significant. However, differences
for the oracle conditions were significant.
meaningless. Semantic relations between drugs,
as captured in UMLS, provide an effective method
7 Discussion and Related Work
for organizing results—these relations cannot be
captured by keyword content alone. Furthermore,
According to two separate evaluations, it appears
term-based approaches suffer from the cluster la-
that our system outperforms the PubMed baseline.
beling problem: it is difficult to automatically gen-
However, our approach provides more advantages
erate a short heading that describes cluster content.
over a linear result set that are not highlighted in
Nevertheless, there are a number of assump-
these evaluations. Although difficult to quantify,
tions behind our work that are worth pointing
categorized results provide an overview of the in-
out. First, we assume a high quality initial re-
formation landscape that is difficult to acquire by
sult set. Since the class of questions we examine
simply browsing a ranked list—user studies of cat-
translates naturally into accurate PubMed queries
egorized search have affirmed its value (Hearst
that can make full use of human-assigned MeSH
and Pedersen, 1996; Dumais et al., 2001). One
terms, the overall quality of the initial citations
main advantage we see in our application is bet-
can be assured. Related work in retrieval algo-
ter “redundancy management”. With a ranked list,
rithms (Demner-Fushman and Lin, 2006 in press)
the physician may be forced to browse through
shows that accurate relevance scoring of MED-
multiple redundant abstracts that discuss the same
LINE citations in response to more general clin-
or similar drugs to get a sense of the different
ical questions is possible.
treatment options. With our cluster-based ap-
Second, our system does not actually perform
proach, however, potentially redundant informa-
semantic processing to determine the efficacy of a
tion is grouped together, since interventions dis-
drug: it only recognizes “topics” and outcome sen-
cussed in a particular cluster are ontologically re-
tences that state clinical findings. Since the sys-
lated through UMLS. The physician can examine
tem by default orders the clusters based on size, it
different clusters for a broad overview, or peruse
implicitly equates “most popular drug” with “best
multiple abstracts within a cluster for a more thor-
drug”. Although this assumption is false, we have
ough review of the evidence. Our cluster-based
observed in practice that more-studied drugs are
system is able to support both types of behaviors.
more likely to be beneficial.
This work demonstrates the value of semantic
In contrast with the genomics domain, which
resources in the question answering process, since
has received much attention from both the IR and
our approach makes extensive use of the UMLS
NLP communities, retrieval systems for the clin-
ontology in all phases of answer synthesis. The
coverage of individual drugs, as well as the rela-
ical domain represent an underexplored area of
tionship between different types of drugs within
research. Although individual components that
attempt to operationalize principles of evidence-
UMLS enables both answer extraction and seman-
based medicine do exist (Mendonc¸a and Cimino,
tic clustering. As detailed in (Demner-Fushman
2001; Niu and Hirst, 2004), complete end–to–
and Lin, 2006 in press), UMLS-based features are
also critical in the identification of clinical out-
end clinical question answering systems are dif-
ficult to find. Within the context of the PERSI-
D. Demner-Fushman and J. Lin. 2005. Knowledge ex-traction for clinical question answering: Preliminary
VAL project (McKeown et al., 2003), researchers
results. InAAAI 2005 Workshop on QA in Restricted
at Columbia have developed a system that lever-
Domains.
ages patient records to rerank search results. Since
D. Demner-Fushman and J. Lin. 2006, in press. An-
the focus is on personalized summaries, this work
swering clinical questions with knowledge-based
can be viewed as complementary to our own.
and statistical techniques. Comp. Ling.
8 Conclusion
S. Dumais, E. Cutrell, and H. Chen. 2001. Optimizingsearch by showing results in context. InCHI 2001.
The primary contribution of this work is the de-
velopment of a clinical question answering system
J. Ely, J. Osheroff, M. Ebell, G. Bergus, B. Levy,M. Chambliss, and E. Evans. 1999. Analysis of
that caters to the unique requirements of physi-
questions asked by family doctors regarding patient
cians, who demand both conciseness and com-
care.BMJ, 319:358–361.
pleteness. These competing factors can be bal-
P. Gorman, J. Ash, and L. Wykoff. 1994. Can pri-
anced in a system’s response by providing mul-
mary care physicians’ questions be answered using
tiple levels of drill-down that allow the informa-
the medical journal literature? Bulletin of the Medi-
tion space to be viewed at different levels of gran-
cal Library Association, 82(2):140–146, April.
ularity. We have chosen to implement these capa-
S. Hauser, D. Demner-Fushman, G. Ford, and
bilities through answer extraction, semantic clus-
G. Thoma. 2004. PubMed on Tap: Discovering
tering, and extractive summarization. Two sepa-
design principles for online information delivery to
rate evaluations demonstrate that our system out-
handheld computers. InMEDINFO 2004.
performs the PubMed baseline, illustrating the ef-
M. Hearst and J. Pedersen. 1996. Reexaming the clus-
fectiveness of a hybrid approach that leverages se-
ter hypothesis: Scatter/gather on retrieval results. In
mantic resources.
SIGIR 1996.
9 Acknowledgments
D. Lawrie and W. Croft. 2003. Generating hierarchicalsummaries for Web searches. InSIGIR 2003.
This work was supported in part by the U.S. Na-
J. Lin. 2005. Evaluation of resources for question an-
tional Library of Medicine. The second author
swering evaluation. InSIGIR 2005.
thanks Esther and Kiri for their loving support.
D. Lindberg, B. Humphreys, and A. McCray. 1993.The Unified Medical Language System.Methods of
References
Information in Medicine, 32(4):281–291.E. Amig´o, J. Gonzalo, V. Peinado, A. Pe˜nas, andK. McKeown, N. Elhadad, and V. Hatzivassiloglou.F. Verdejo. 2004. An empirical study of informa-