2 AUTOMATIC EVALUATION OF ABSTRACTSTHE ASSESSOR HAD NO IDEA FROM WHI...

6.2 Automatic Evaluation of Abstracts

the assessor had no idea from which condition an

answer came, this process guarded against asses-

A major limitation of the factoid-based evaluation

sor bias.

methodology is that it does not measure the qual-

Each answer was evaluated in two different

ity of the abstracts from which the short answers

ways: first, with respect to the ground truth in CE,

were extracted. Since we lacked the necessary

and second, using the assessor’s own medical ex-

resources to manually gather abstract-level judg-

pertise. In the first set of judgments, the asses-

ments for evaluation, we sought an alternative.

sor determined which of the six categories (ben-

Fortunately, CE can be leveraged to assess the

eficial, likely beneficial, tradeoffs, unknown, un-

“goodness” of abstracts automatically. We assume

likely beneficial, harmful) the system answer be-

that references cited in CE are examples of high

longed to, based on the CE recommendations. As

quality abstracts, since they were used in gener-

we have discussed previously, a human (with suf-

ating the drug recommendations. Following stan-

ficient domain knowledge) is required to perform

dard assumptions made in summarization evalu-

this matching due to synonymy and differences in

ation, we considered abstracts that are similar in

ontological granularity. However, note that the as-

content with these “reference abstracts” to also be

sessor only considered the drug name when mak-

“good” (i.e., relevant). Similarity in content can

ing this categorization. In the second set of judg-

be quantified with R

OUGE

.

ments, the assessor separately determined if the

Since physicians demand high precision, we as-

short answer was “good”, “okay” (marginal), or

sess the cumulative relevance after the first, sec-

“bad” based both on CE and her own experience,

ond, and third abstract that the clinician is likely

taking into account the abstract title and the top-

to have examined (where the relevance for each

scoring outcome sentence (and if necessary, the

individual abstract is given by its R

OUGE

-1 pre-

entire abstract text).

cision score). For the baseline PubMed condition,

Results of this manual evaluation are presented

the examined abstracts simply correspond to the

in Table 2, which shows the distribution of judg-

first three hits in the result set. For our test system,

ments for the three experimental conditions. For

we developed three different orderings. The first,

baseline PubMed, 20% of the examined drugs fell

which we term cluster round-robin, selects the first

in the beneficial category; the values are 39% for

abstract from the top three clusters (by size). The

the Cluster condition and 40% for the Oracle con-

second, which we term oracle cluster order, selects

dition. In terms of short answers, our system

three abstracts from the best cluster, assuming the

returns approximately twice as many beneficial

existence of an oracle that informs the system. The

drugs as the baseline, a marked increase in answer

third, which we term oracle round-robin, selects

accuracy. Note that a large fraction of the drugs

the first abstract from each of the three best clus-

evaluated were not found in CE at all, which pro-

ters (also determined by an oracle).

vides an estimate of its coverage. In terms of the

Results of this evaluation are shown in Table 3.

assessor’s own judgments, 60% of PubMed short

The columns show the cumulative relevance (i.e.,

answers were found to be “good”, compared to

83% and 89% for the Cluster and Oracle condi-

R

OUGE

score) after examining the first, second,

tions, respectively. From a factoid QA point of

and third abstract, under the different ordering

view, we can conclude that our system outper-

conditions. To determine statistical significance,

forms the PubMed baseline.

we applied the Wilcoxon signed-rank test, the

Rank 1 Rank 2 Rank 3

PubMed Ranked List 0.170 0.349 0.523

Cluster Round-Robin 0.181 (+6.3%)

0.356 (+2.1%)

0.526 (+0.5%)

Oracle Cluster Order 0.206 (+21.5%)

M

0.392 (+12.6%)

M

0.597 (+14.0%)

N

Oracle Round-Robin 0.206 (+21.5%)

M

0.396 (+13.6%)

M

0.586 (+11.9%)

N

Table 3: Cumulative relevance after examining the first, second, and third abstracts, according to different

orderings. (

denotes n.s.,

M

denotes sig. at 0.90,

N

denotes sig. at 0.95)

comes, on which our extractive summaries are

standard non-parametric test for applications of

this type. Due to the relatively small test set (only

based. As a point of comparison, we also im-

25 questions), the increase in cumulative relevance

plemented a purely term-based approach to clus-

tering PubMed citations. The results are so inco-

exhibited by the cluster round-robin condition is

herent that a formal evaluation would prove to be

not statistically significant. However, differences

for the oracle conditions were significant.

meaningless. Semantic relations between drugs,

as captured in UMLS, provide an effective method

7 Discussion and Related Work

for organizing results—these relations cannot be

captured by keyword content alone. Furthermore,

According to two separate evaluations, it appears

term-based approaches suffer from the cluster la-

that our system outperforms the PubMed baseline.

beling problem: it is difficult to automatically gen-

However, our approach provides more advantages

erate a short heading that describes cluster content.

over a linear result set that are not highlighted in

Nevertheless, there are a number of assump-

these evaluations. Although difficult to quantify,

tions behind our work that are worth pointing

categorized results provide an overview of the in-

out. First, we assume a high quality initial re-

formation landscape that is difficult to acquire by

sult set. Since the class of questions we examine

simply browsing a ranked list—user studies of cat-

translates naturally into accurate PubMed queries

egorized search have affirmed its value (Hearst

that can make full use of human-assigned MeSH

and Pedersen, 1996; Dumais et al., 2001). One

terms, the overall quality of the initial citations

main advantage we see in our application is bet-

can be assured. Related work in retrieval algo-

ter “redundancy management”. With a ranked list,

rithms (Demner-Fushman and Lin, 2006 in press)

the physician may be forced to browse through

shows that accurate relevance scoring of MED-

multiple redundant abstracts that discuss the same

LINE citations in response to more general clin-

or similar drugs to get a sense of the different

ical questions is possible.

treatment options. With our cluster-based ap-

Second, our system does not actually perform

proach, however, potentially redundant informa-

semantic processing to determine the efficacy of a

tion is grouped together, since interventions dis-

drug: it only recognizes “topics” and outcome sen-

cussed in a particular cluster are ontologically re-

tences that state clinical findings. Since the sys-

lated through UMLS. The physician can examine

tem by default orders the clusters based on size, it

different clusters for a broad overview, or peruse

implicitly equates “most popular drug” with “best

multiple abstracts within a cluster for a more thor-

drug”. Although this assumption is false, we have

ough review of the evidence. Our cluster-based

observed in practice that more-studied drugs are

system is able to support both types of behaviors.

more likely to be beneficial.

This work demonstrates the value of semantic

In contrast with the genomics domain, which

resources in the question answering process, since

has received much attention from both the IR and

our approach makes extensive use of the UMLS

NLP communities, retrieval systems for the clin-

ontology in all phases of answer synthesis. The

coverage of individual drugs, as well as the rela-

ical domain represent an underexplored area of

tionship between different types of drugs within

research. Although individual components that

attempt to operationalize principles of evidence-

UMLS enables both answer extraction and seman-

based medicine do exist (Mendonc¸a and Cimino,

tic clustering. As detailed in (Demner-Fushman

2001; Niu and Hirst, 2004), complete end–to–

and Lin, 2006 in press), UMLS-based features are

also critical in the identification of clinical out-

end clinical question answering systems are dif-

ficult to find. Within the context of the PERSI-

D. Demner-Fushman and J. Lin. 2005. Knowledge ex-traction for clinical question answering: Preliminary

VAL project (McKeown et al., 2003), researchers

results. InAAAI 2005 Workshop on QA in Restricted

at Columbia have developed a system that lever-

Domains.

ages patient records to rerank search results. Since

D. Demner-Fushman and J. Lin. 2006, in press. An-

the focus is on personalized summaries, this work

swering clinical questions with knowledge-based

can be viewed as complementary to our own.

and statistical techniques. Comp. Ling.

8 Conclusion

S. Dumais, E. Cutrell, and H. Chen. 2001. Optimizingsearch by showing results in context. InCHI 2001.

The primary contribution of this work is the de-

velopment of a clinical question answering system

J. Ely, J. Osheroff, M. Ebell, G. Bergus, B. Levy,M. Chambliss, and E. Evans. 1999. Analysis of

that caters to the unique requirements of physi-

questions asked by family doctors regarding patient

cians, who demand both conciseness and com-

care.BMJ, 319:358–361.

pleteness. These competing factors can be bal-

P. Gorman, J. Ash, and L. Wykoff. 1994. Can pri-

anced in a system’s response by providing mul-

mary care physicians’ questions be answered using

tiple levels of drill-down that allow the informa-

the medical journal literature? Bulletin of the Medi-

tion space to be viewed at different levels of gran-

cal Library Association, 82(2):140–146, April.

ularity. We have chosen to implement these capa-

S. Hauser, D. Demner-Fushman, G. Ford, and

bilities through answer extraction, semantic clus-

G. Thoma. 2004. PubMed on Tap: Discovering

tering, and extractive summarization. Two sepa-

design principles for online information delivery to

rate evaluations demonstrate that our system out-

handheld computers. InMEDINFO 2004.

performs the PubMed baseline, illustrating the ef-

M. Hearst and J. Pedersen. 1996. Reexaming the clus-

fectiveness of a hybrid approach that leverages se-

ter hypothesis: Scatter/gather on retrieval results. In

mantic resources.

SIGIR 1996.

9 Acknowledgments

D. Lawrie and W. Croft. 2003. Generating hierarchicalsummaries for Web searches. InSIGIR 2003.

This work was supported in part by the U.S. Na-

J. Lin. 2005. Evaluation of resources for question an-

tional Library of Medicine. The second author

swering evaluation. InSIGIR 2005.

thanks Esther and Kiri for their loving support.

D. Lindberg, B. Humphreys, and A. McCray. 1993.The Unified Medical Language System.Methods of

References

Information in Medicine, 32(4):281–291.E. Amig´o, J. Gonzalo, V. Peinado, A. Pe˜nas, andK. McKeown, N. Elhadad, and V. Hatzivassiloglou.F. Verdejo. 2004. An empirical study of informa-