2 SEMANTIC CLUSTERINGTIRE ARTICLE (IF AVAILABLE). IN THE EXAMPLE SHO...

4.2 Semantic Clustering

tire article (if available). In the example shown in

Retrieved MEDLINE citations are organized into

Figure 1, the physician can see that two classes of

semantic clusters based on the main interventions

drugs (anti-microbial and alpha-adrenergic block-

identified in the abstract text. We employed a

ing agent) are relevant for the disease “chronic

variant of the hierarchical agglomerative cluster-

prostatitis”. Drilling down into the first cluster, the

ing algorithm (Zhao and Karypis, 2002) that uti-

physician can see summarized evidence for two

lizes semantic relationships within UMLS to com-

specific types of anti-microbials (temafloxacin and

pute similarities between interventions.

ofloxacin) extracted from MEDLINE abstracts.

Iteratively, we group abstracts whose interven-

Three major capabilities are required to produce

tions fall under a common ancestor, i.e., a hyper-

the “answers” described above. First, the system

nym. The more generic ancestor concept (i.e., the

must accurately identify the drugs under study in

class of drugs) is then used as the cluster label.

an abstract. Second, the system must group ab-

The process repeats until no new clusters can be

stracts based on these substances in a meaningful

formed. In order to preserve granularity at the

way. Third, the system must generate short sum-

level of practical clinical interest, the tops of the

maries of the clinical findings. We describe a clin-

UMLS hierarchy were truncated; for example, the

ical question answering system that implements

MeSH category “Chemical and Drugs” is too gen-

exactly these capabilities (answer extraction, se-

eral to be useful. This process was manually per-

mantic clustering, and extractive summarization).

formed during system development. We decided

4 System Implementation

to allow an abstract to appear in multiple clusters

if more than one intervention was identified, e.g.,

Our work is primarily concerned with synthesiz-

if the abstract compared the efficacy of two treat-

ing coherent answers from a set of search results—

ments. Once the clusters have been formed, all

the actual source of these results is not important.

citations are then sorted in the order of the origi-

For convenience, we employ MEDLINE citations

nal PubMed results, with the most abstract UMLS

retrieved by the PubMed search engine (which

concept as the cluster label. Clusters themselves

also serves as a baseline for comparison). Given

are sorted in decreasing size under the assumption

an initial set of citations, answer generation pro-

that more clinical research is devoted to more per-

ceeds in three phases, described below.

tinent types of drugs.

Returning to the example in Figure 1, the ab-