1 PREVIOUS WORKEXPANDABLE LISTS, ETC. WHILE INTERFACE DESIGN ISCLEAR...

5.1 Previous Work

expandable lists, etc. While interface design is

clearly important, it is not the focus of our work.

How can we leverage a resource such as CE to as-

sess the responses generated by our system? A

Clustering techniques have also been evaluated

survey of evaluation methodologies reveals short-

in the same manner as text classification algo-

comings in existing techniques.

rithms, in terms of precision, recall, etc. based

Answers to factoid questions are automatically

on some ground truth (Zhao and Karypis, 2002).

This, however, assumes the existence of stable,

ticles), and encode a substantial amount of knowl-

invariant categories, which is not the case since

edge about the contents of the citation. PubMed

our output clusters are query-specific. Although

allows searches on MeSH terms, which usually

yield accurate results. In addition, we limited re-

it may be possible to manually create “reference

clusters”, we lack sufficient resources to develop

trieved citations to those that have the MeSH head-

ing “drug therapy” and those that describe a clin-

such a data set. Furthermore, it is unclear if suffi-

cient interannotator agreement can be obtained to

ical trial (another metadata field). Finally, we re-

stricted the date range of the queries so that ab-

support meaningful evaluation.

stracts published after our version of CE were ex-

Ultimately, we devised two separate evaluations

cluded. Although the query formulation process

to assess the quality of our system output based

currently requires a human, we envision automat-

on the techniques discussed above. The first is

ing this step using a template-based approach in

a manual evaluation focused on the cluster labels

the future.

(i.e., drug categories), based on a factoid QA eval-

uation methodology. The second is an automatic

6 System Evaluation

evaluation of the retrieved abstracts using R

OUGE

,

drawing elements from summarization evaluation.

We adapted existing techniques to evaluate our

Details of the evaluation setup and results are pre-

system in two separate ways: a factoid-style man-

ceded by a description of the test collection we

ual evaluation focused on short answers and an

created from CE.

automatic evaluation with R

OUGE

using CE-cited

abstracts as the reference summaries. The setup