2 AUTOMATIC EVALUATION OF ABSTRACTSTHE ASSESSOR HAD NO IDEA FROM WHI...

Question

6.2 Automatic Evaluation of Abstractsthe assessor had no idea from which condition ananswer came, this process guarded against asses-A major limitation of the factoid-based evaluationsor bias.methodology is that it does not measure the qual-Each answer was evaluated in two differentity of the abstracts from which the short answersways: first, with respect to the ground truth in CE,were extracted. Since we lacked the necessaryand second, using the assessor’s own medical ex-resources to manually gather abstract-level judg-pertise. In the first set of judgments, the asses-ments for evaluation, we sought an alternative.sor determined which of the six categories (ben-Fortunately, CE can be leveraged to assess theeficial, likely beneficial, tradeoffs, unknown, un-“goodness” of abstracts automatically. We assumelikely beneficial, harmful) the system answer be-that references cited in CE are examples of highlonged to, based on the CE recommendations. Asquality abstracts, since they were used in gener-we have discussed previously, a human (with suf-ating the drug recommendations. Following stan-ficient domain knowledge) is required to performdard assumptions made in summarization evalu-this matching due to synonymy and differences ination, we considered abstracts that are similar inontological granularity. However, note that the as-content with these “reference abstracts” to also besessor only considered the drug name when mak-“good” (i.e., relevant). Similarity in content caning this categorization. In the second set of judg-be quantified with ROUGE.ments, the assessor separately determined if theSince physicians demand high precision, we as-short answer was “good”, “okay” (marginal), orsess the cumulative relevance after the first, sec-“bad” based both on CE and her own experience,ond, and third abstract that the clinician is likelytaking into account the abstract title and the top-to have examined (where the relevance for eachscoring outcome sentence (and if necessary, theindividual abstract is given by its ROUGE-1 pre-entire abstract text).cision score). For the baseline PubMed condition,Results of this manual evaluation are presentedthe examined abstracts simply correspond to thein Table 2, which shows the distribution of judg-first three hits in the result set. For our test system,ments for the three experimental conditions. Forwe developed three different orderings. The first,baseline PubMed, 20% of the examined drugs fellwhich we term cluster round-robin, selects the firstin the beneficial category; the values are 39% forabstract from the top three clusters (by size). Thethe Cluster condition and 40% for the Oracle con-second, which we term oracle cluster order, selectsdition. In terms of short answers, our systemthree abstracts from the best cluster, assuming thereturns approximately twice as many beneficialexistence of an oracle that informs the system. Thedrugs as the baseline, a marked increase in answerthird, which we term oracle round-robin, selectsaccuracy. Note that a large fraction of the drugsthe first abstract from each of the three best clus-evaluated were not found in CE at all, which pro-ters (also determined by an oracle).vides an estimate of its coverage. In terms of theResults of this evaluation are shown in Table 3.assessor’s own judgments, 60% of PubMed shortThe columns show the cumulative relevance (i.e.,answers were found to be “good”, compared to83% and 89% for the Cluster and Oracle condi-ROUGE score) after examining the first, second,tions, respectively. From a factoid QA point ofand third abstract, under the different orderingview, we can conclude that our system outper-conditions. To determine statistical significance,forms the PubMed baseline.we applied the Wilcoxon signed-rank test, theRank 1 Rank 2 Rank 3PubMed Ranked List 0.170 0.349 0.523Cluster Round-Robin 0.181 (+6.3%)◦ 0.356 (+2.1%)◦ 0.526 (+0.5%)◦Oracle Cluster Order 0.206 (+21.5%)M 0.392 (+12.6%)M 0.597 (+14.0%)NOracle Round-Robin 0.206 (+21.5%)M 0.396 (+13.6%)M 0.586 (+11.9%)NTable 3: Cumulative relevance after examining the first, second, and third abstracts, according to differentorderings. (◦ denotes n.s.,Mdenotes sig. at 0.90,Ndenotes sig. at 0.95)comes, on which our extractive summaries arestandard non-parametric test for applications ofthis type. Due to the relatively small test set (onlybased. As a point of comparison, we also im-25 questions), the increase in cumulative relevanceplemented a purely term-based approach to clus-tering PubMed citations. The results are so inco-exhibited by the cluster round-robin condition isherent that a formal evaluation would prove to benot statistically significant. However, differencesfor the oracle conditions were significant.meaningless. Semantic relations between drugs,as captured in UMLS, provide an effective method7 Discussion and Related Workfor organizing results—these relations cannot becaptured by keyword content alone. Furthermore,According to two separate evaluations, it appearsterm-based approaches suffer from the cluster la-that our system outperforms the PubMed baseline.beling problem: it is difficult to automatically gen-However, our approach provides more advantageserate a short heading that describes cluster content.over a linear result set that are not highlighted inNevertheless, there are a number of assump-these evaluations. Although difficult to quantify,tions behind our work that are worth pointingcategorized results provide an overview of the in-out. First, we assume a high quality initial re-formation landscape that is difficult to acquire bysult set. Since the class of questions we examinesimply browsing a ranked list—user studies of cat-translates naturally into accurate PubMed queriesegorized search have affirmed its value (Hearstthat can make full use of human-assigned MeSHand Pedersen, 1996; Dumais et al., 2001). Oneterms, the overall quality of the initial citationsmain advantage we see in our application is bet-can be assured. Related work in retrieval algo-ter “redundancy management”. With a ranked list,rithms (Demner-Fushman and Lin, 2006 in press)the physician may be forced to browse throughshows that accurate relevance scoring of MED-multiple redundant abstracts that discuss the sameLINE citations in response to more general clin-or similar drugs to get a sense of the differentical questions is possible.treatment options. With our cluster-based ap-Second, our system does not actually performproach, however, potentially redundant informa-semantic processing to determine the efficacy of ation is grouped together, since interventions dis-drug: it only recognizes “topics” and outcome sen-cussed in a particular cluster are ontologically re-tences that state clinical findings. Since the sys-lated through UMLS. The physician can examinetem by default orders the clusters based on size, itdifferent clusters for a broad overview, or peruseimplicitly equates “most popular drug” with “bestmultiple abstracts within a cluster for a more thor-drug”. Although this assumption is false, we haveough review of the evidence. Our cluster-basedobserved in practice that more-studied drugs aresystem is able to support both types of behaviors.more likely to be beneficial.This work demonstrates the value of semanticIn contrast with the genomics domain, whichresources in the question answering process, sincehas received much attention from both the IR andour approach makes extensive use of the UMLSNLP communities, retrieval systems for the clin-ontology in all phases of answer synthesis. Thecoverage of individual drugs, as well as the rela-ical domain represent an underexplored area oftionship between different types of drugs withinresearch. Although individual components thatattempt to operationalize principles of evidence-UMLS enables both answer extraction and seman-based medicine do exist (Mendonc¸a and Cimino,tic clustering. As detailed in (Demner-Fushman2001; Niu and Hirst, 2004), complete end–to–and Lin, 2006 in press), UMLS-based features arealso critical in the identification of clinical out-end clinical question answering systems are dif-ficult to find. Within the context of the PERSI-D. Demner-Fushman and J. Lin. 2005. Knowledge ex-traction for clinical question answering: PreliminaryVAL project (McKeown et al., 2003), researchersresults. InAAAI 2005 Workshop on QA in Restrictedat Columbia have developed a system that lever-Domains.ages patient records to rerank search results. SinceD. Demner-Fushman and J. Lin. 2006, in press. An-the focus is on personalized summaries, this workswering clinical questions with knowledge-basedcan be viewed as complementary to our own.and statistical techniques. Comp. Ling.8 ConclusionS. Dumais, E. Cutrell, and H. Chen. 2001. Optimizingsearch by showing results in context. InCHI 2001.The primary contribution of this work is the de-velopment of a clinical question answering systemJ. Ely, J. Osheroff, M. Ebell, G. Bergus, B. Levy,M. Chambliss, and E. Evans. 1999. Analysis ofthat caters to the unique requirements of physi-questions asked by family doctors regarding patientcians, who demand both conciseness and com-care.BMJ, 319:358–361.pleteness. These competing factors can be bal-P. Gorman, J. Ash, and L. Wykoff. 1994. Can pri-anced in a system’s response by providing mul-mary care physicians’ questions be answered usingtiple levels of drill-down that allow the informa-the medical journal literature? Bulletin of the Medi-tion space to be viewed at different levels of gran-cal Library Association, 82(2):140–146, April.ularity. We have chosen to implement these capa-S. Hauser, D. Demner-Fushman, G. Ford, andbilities through answer extraction, semantic clus-G. Thoma. 2004. PubMed on Tap: Discoveringtering, and extractive summarization. Two sepa-design principles for online information delivery torate evaluations demonstrate that our system out-handheld computers. InMEDINFO 2004.performs the PubMed baseline, illustrating the ef-M. Hearst and J. Pedersen. 1996. Reexaming the clus-fectiveness of a hybrid approach that leverages se-ter hypothesis: Scatter/gather on retrieval results. Inmantic resources.SIGIR 1996.9 AcknowledgmentsD. Lawrie and W. Croft. 2003. Generating hierarchicalsummaries for Web searches. InSIGIR 2003.This work was supported in part by the U.S. Na-J. Lin. 2005. Evaluation of resources for question an-tional Library of Medicine. The second authorswering evaluation. InSIGIR 2005.thanks Esther and Kiri for their loving support.D. Lindberg, B. Humphreys, and A. McCray. 1993.The Unified Medical Language System.Methods ofReferencesInformation in Medicine, 32(4):281–291.E. Amig´o, J. Gonzalo, V. Peinado, A. Pe˜nas, andK. McKeown, N. Elhadad, and V. Hatzivassiloglou.F. Verdejo. 2004. An empirical study of informa-

Answer

6.2 Automatic Evaluation of Abstractsthe assessor had no idea from which condition ananswer came, this process guarded against asses-A major limitation of the factoid-based evaluationsor bias.methodology is that it does not measure the qual-Each answer was evaluated in two differentity of the abstracts from which the short answersways: first, with respect to the ground truth in CE,were extracted. Since we lacked the necessaryand second, using the assessor’s own medical ex-resources to manually gather abstract-level judg-pertise. In the first set of judgments, the asses-ments for evaluation, we sought an alternative.sor determined which of the six categories (ben-Fortunately, CE can be leveraged to assess theeficial, likely beneficial, tradeoffs, unknown, un-“goodness” of abstracts automatically. We assumelikely beneficial, harmful) the system answer be-that references cited in CE are examples of highlonged to, based on the CE recommendations. Asquality abstracts, since they were used in gener-we have discussed previously, a human (with suf-ating the drug recommendations. Following stan-ficient domain knowledge) is required to performdard assumptions made in summarization evalu-this matching due to synonymy and differences ination, we considered abstracts that are similar inontological granularity. However, note that the as-content with these “reference abstracts” to also besessor only considered the drug name when mak-“good” (i.e., relevant). Similarity in content caning this categorization. In the second set of judg-be quantified with ROUGE.ments, the assessor separately determined if theSince physicians demand high precision, we as-short answer was “good”, “okay” (marginal), orsess the cumulative relevance after the first, sec-“bad” based both on CE and her own experience,ond, and third abstract that the clinician is likelytaking into account the abstract title and the top-to have examined (where the relevance for eachscoring outcome sentence (and if necessary, theindividual abstract is given by its ROUGE-1 pre-entire abstract text).cision score). For the baseline PubMed condition,Results of this manual evaluation are presentedthe examined abstracts simply correspond to thein Table 2, which shows the distribution of judg-first three hits in the result set. For our test system,ments for the three experimental conditions. Forwe developed three different orderings. The first,baseline PubMed, 20% of the examined drugs fellwhich we term cluster round-robin, selects the firstin the beneficial category; the values are 39% forabstract from the top three clusters (by size). Thethe Cluster condition and 40% for the Oracle con-second, which we term oracle cluster order, selectsdition. In terms of short answers, our systemthree abstracts from the best cluster, assuming thereturns approximately twice as many beneficialexistence of an oracle that informs the system. Thedrugs as the baseline, a marked increase in answerthird, which we term oracle round-robin, selectsaccuracy. Note that a large fraction of the drugsthe first abstract from each of the three best clus-evaluated were not found in CE at all, which pro-ters (also determined by an oracle).vides an estimate of its coverage. In terms of theResults of this evaluation are shown in Table 3.assessor’s own judgments, 60% of PubMed shortThe columns show the cumulative relevance (i.e.,answers were found to be “good”, compared to83% and 89% for the Cluster and Oracle condi-ROUGE score) after examining the first, second,tions, respectively. From a factoid QA point ofand third abstract, under the different orderingview, we can conclude that our system outper-conditions. To determine statistical significance,forms the PubMed baseline.we applied the Wilcoxon signed-rank test, theRank 1 Rank 2 Rank 3PubMed Ranked List 0.170 0.349 0.523Cluster Round-Robin 0.181 (+6.3%)◦ 0.356 (+2.1%)◦ 0.526 (+0.5%)◦Oracle Cluster Order 0.206 (+21.5%)M 0.392 (+12.6%)M 0.597 (+14.0%)NOracle Round-Robin 0.206 (+21.5%)M 0.396 (+13.6%)M 0.586 (+11.9%)NTable 3: Cumulative relevance after examining the first, second, and third abstracts, according to differentorderings. (◦ denotes n.s.,Mdenotes sig. at 0.90,Ndenotes sig. at 0.95)comes, on which our extractive summaries arestandard non-parametric test for applications ofthis type. Due to the relatively small test set (onlybased. As a point of comparison, we also im-25 questions), the increase in cumulative relevanceplemented a purely term-based approach to clus-tering PubMed citations. The results are so inco-exhibited by the cluster round-robin condition isherent that a formal evaluation would prove to benot statistically significant. However, differencesfor the oracle conditions were significant.meaningless. Semantic relations between drugs,as captured in UMLS, provide an effective method7 Discussion and Related Workfor organizing results—these relations cannot becaptured by keyword content alone. Furthermore,According to two separate evaluations, it appearsterm-based approaches suffer from the cluster la-that our system outperforms the PubMed baseline.beling problem: it is difficult to automatically gen-However, our approach provides more advantageserate a short heading that describes cluster content.over a linear result set that are not highlighted inNevertheless, there are a number of assump-these evaluations. Although difficult to quantify,tions behind our work that are worth pointingcategorized results provide an overview of the in-out. First, we assume a high quality initial re-formation landscape that is difficult to acquire bysult set. Since the class of questions we examinesimply browsing a ranked list—user studies of cat-translates naturally into accurate PubMed queriesegorized search have affirmed its value (Hearstthat can make full use of human-assigned MeSHand Pedersen, 1996; Dumais et al., 2001). Oneterms, the overall quality of the initial citationsmain advantage we see in our application is bet-can be assured. Related work in retrieval algo-ter “redundancy management”. With a ranked list,rithms (Demner-Fushman and Lin, 2006 in press)the physician may be forced to browse throughshows that accurate relevance scoring of MED-multiple redundant abstracts that discuss the sameLINE citations in response to more general clin-or similar drugs to get a sense of the differentical questions is possible.treatment options. With our cluster-based ap-Second, our system does not actually performproach, however, potentially redundant informa-semantic processing to determine the efficacy of ation is grouped together, since interventions dis-drug: it only recognizes “topics” and outcome sen-cussed in a particular cluster are ontologically re-tences that state clinical findings. Since the sys-lated through UMLS. The physician can examinetem by default orders the clusters based on size, itdifferent clusters for a broad overview, or peruseimplicitly equates “most popular drug” with “bestmultiple abstracts within a cluster for a more thor-drug”. Although this assumption is false, we haveough review of the evidence. Our cluster-basedobserved in practice that more-studied drugs aresystem is able to support both types of behaviors.more likely to be beneficial.This work demonstrates the value of semanticIn contrast with the genomics domain, whichresources in the question answering process, sincehas received much attention from both the IR andour approach makes extensive use of the UMLSNLP communities, retrieval systems for the clin-ontology in all phases of answer synthesis. Thecoverage of individual drugs, as well as the rela-ical domain represent an underexplored area oftionship between different types of drugs withinresearch. Although individual components thatattempt to operationalize principles of evidence-UMLS enables both answer extraction and seman-based medicine do exist (Mendonc¸a and Cimino,tic clustering. As detailed in (Demner-Fushman2001; Niu and Hirst, 2004), complete end–to–and Lin, 2006 in press), UMLS-based features arealso critical in the identification of clinical out-end clinical question answering systems are dif-ficult to find. Within the context of the PERSI-D. Demner-Fushman and J. Lin. 2005. Knowledge ex-traction for clinical question answering: PreliminaryVAL project (McKeown et al., 2003), researchersresults. InAAAI 2005 Workshop on QA in Restrictedat Columbia have developed a system that lever-Domains.ages patient records to rerank search results. SinceD. Demner-Fushman and J. Lin. 2006, in press. An-the focus is on personalized summaries, this workswering clinical questions with knowledge-basedcan be viewed as complementary to our own.and statistical techniques. Comp. Ling.8 ConclusionS. Dumais, E. Cutrell, and H. Chen. 2001. Optimizingsearch by showing results in context. InCHI 2001.The primary contribution of this work is the de-velopment of a clinical question answering systemJ. Ely, J. Osheroff, M. Ebell, G. Bergus, B. Levy,M. Chambliss, and E. Evans. 1999. Analysis ofthat caters to the unique requirements of physi-questions asked by family doctors regarding patientcians, who demand both conciseness and com-care.BMJ, 319:358–361.pleteness. These competing factors can be bal-P. Gorman, J. Ash, and L. Wykoff. 1994. Can pri-anced in a system’s response by providing mul-mary care physicians’ questions be answered usingtiple levels of drill-down that allow the informa-the medical journal literature? Bulletin of the Medi-tion space to be viewed at different levels of gran-cal Library Association, 82(2):140–146, April.ularity. We have chosen to implement these capa-S. Hauser, D. Demner-Fushman, G. Ford, andbilities through answer extraction, semantic clus-G. Thoma. 2004. PubMed on Tap: Discoveringtering, and extractive summarization. Two sepa-design principles for online information delivery torate evaluations demonstrate that our system out-handheld computers. InMEDINFO 2004.performs the PubMed baseline, illustrating the ef-M. Hearst and J. Pedersen. 1996. Reexaming the clus-fectiveness of a hybrid approach that leverages se-ter hypothesis: Scatter/gather on retrieval results. Inmantic resources.SIGIR 1996.9 AcknowledgmentsD. Lawrie and W. Croft. 2003. Generating hierarchicalsummaries for Web searches. InSIGIR 2003.This work was supported in part by the U.S. Na-J. Lin. 2005. Evaluation of resources for question an-tional Library of Medicine. The second authorswering evaluation. InSIGIR 2005.thanks Esther and Kiri for their loving support.D. Lindberg, B. Humphreys, and A. McCray. 1993.The Unified Medical Language System.Methods ofReferencesInformation in Medicine, 32(4):281–291.E. Amig´o, J. Gonzalo, V. Peinado, A. Pe˜nas, andK. McKeown, N. Elhadad, and V. Hatzivassiloglou.F. Verdejo. 2004. An empirical study of informa-