DEVELOPED FOR SCORING ANSWERS TO COM-MARIZATION, A PROPER EVALUA...

Question

2005) developed for scoring answers to com-marization, a proper evaluation that captures theplex questions is not suitable for our task, sincesalient characteristics of our system proved to bethere is no coherent notion of an “answer text”quite challenging. Overall, evaluation can be de-that the user reads end–to–end. Furthermore, itcomposed into two separate components: locatingis unclear what exactly a “nugget” in this casea suitable resource to serve as ground truth andwould be. For similar reasons, methodologies forleveraging it to assess system responses.summarization evaluation are also of little help.It is not difficult to find disease-specific pharma-Typically, system-generated summaries are eithercology resources. We employed Clinical Evidenceevaluated manually by humans (which is expen-(CE), a periodic report created by the British Med-sive and time-consuming) or automatically usingical Journal (BMJ) Publishing Group that summa-a metric such as ROUGE, which compares sys-rizes the best known drugs for a few dozen dis-tem output against a number of reference sum-eases. Note that the existence of such secondarymaries. The interactive nature of our answers vio-sources does not obviate the need for automatedlates the assumption that systems’ responses aresystems because they are perpetually falling out ofstatic text segments. Furthermore, it is uncleardate due to rapid advances in medicine. Further-what exactly should go into a reference summary,more, such reports are currently created by highly-because physicians may want varying amounts ofexperienced physicians, which is an expensive anddetail depending on familiarity with the diseasetime-consuming process.and patient-specific factors.For each disease, CE classifies drugs into one ofEvaluation methodologies from information re-six categories: beneficial, likely beneficial, trade-trieval are also inappropriate. User studies haveoffs (i.e., may have adverse side effects), un-previously been employed to examine the effectknown, unlikely beneficial, and harmful. Includedof categorized search results. However, they oftenwith each entry is a list of references—citationsconflate the effectiveness of the interface with thatconsulted by the editors in compiling the resource.of the underlying algorithms. For example, Du-Although the completeness of the drugs enumer-mais et al. (2001) found significant differences inated in CE is questionable, it nevertheless can betask performance based on different ways of usingviewed as “authoritative”.purely presentational devices such as mouseovers,

DEVELOPED FOR SCORING ANSWERS TO COM-MARIZATION, A PROPER EVALUA...

2005) developed for scoring answers to com-

marization, a proper evaluation that captures the

plex questions is not suitable for our task, since

salient characteristics of our system proved to be

there is no coherent notion of an “answer text”

quite challenging. Overall, evaluation can be de-

that the user reads end–to–end. Furthermore, it

composed into two separate components: locating

is unclear what exactly a “nugget” in this case

a suitable resource to serve as ground truth and

would be. For similar reasons, methodologies for

leveraging it to assess system responses.

summarization evaluation are also of little help.

It is not difficult to find disease-specific pharma-

Typically, system-generated summaries are either

cology resources. We employed Clinical Evidence

evaluated manually by humans (which is expen-

(CE), a periodic report created by the British Med-

sive and time-consuming) or automatically using

ical Journal (BMJ) Publishing Group that summa-

a metric such as R

, which compares sys-

rizes the best known drugs for a few dozen dis-

tem output against a number of reference sum-

eases. Note that the existence of such secondary

maries. The interactive nature of our answers vio-

sources does not obviate the need for automated

lates the assumption that systems’ responses are

systems because they are perpetually falling out of

static text segments. Furthermore, it is unclear

date due to rapid advances in medicine. Further-

what exactly should go into a reference summary,

more, such reports are currently created by highly-

because physicians may want varying amounts of

experienced physicians, which is an expensive and

detail depending on familiarity with the disease

time-consuming process.

and patient-specific factors.

For each disease, CE classifies drugs into one of

Evaluation methodologies from information re-

six categories: beneficial, likely beneficial, trade-

trieval are also inappropriate. User studies have

offs (i.e., may have adverse side effects), un-

previously been employed to examine the effect

known, unlikely beneficial, and harmful. Included

of categorized search results. However, they often

with each entry is a list of references—citations

conflate the effectiveness of the interface with that

consulted by the editors in compiling the resource.

of the underlying algorithms. For example, Du-

Although the completeness of the drugs enumer-

mais et al. (2001) found significant differences in

ated in CE is questionable, it nevertheless can be

task performance based on different ways of using

viewed as “authoritative”.

purely presentational devices such as mouseovers,

Bạn đang xem 2005) - TÀI LIỆU BÁO CÁO KHOA HỌC ANSWER EXTRACTION SEMANTIC CLUSTERING AND EXTRACTIVE SUMMARIZATION FOR CLINICAL QUESTION ANSWERING PDF