2 QDC OPERATION THE APPARENT REDUNDANCY HERE IS BECAUSE OF THE THE S...

4.2 QDC Operation

The apparent redundancy here is because of the

The system first asked three questions for each

potential NIL answers for some of the date slots.

subject X:

We also rejected combinations of works whose

years spanned more than 100 years (in case there

In what year was X born?

were no BORN or DIED dates). In performing these

In what year did X die?

constraint calculations, NIL satisfied every test by

What compositions did X have?

fiat. The constraint network we used is depicted in

Figure 2.

The third of these triggers our named-entity type

COMPOSITION that is used for all kinds of titled

Birthdate of X

works – books, films, poems, music, plays and so

Work W

i

on, and also quotations. Our named-entity recog-

Author X

Date of W

i

X

i

= Author of W

i

nizer has rules to detect works of art by phrases that

are in apposition to “the film … ” or the “the book

Deathdate of X

… ” etc., and also captures any short phrase in quotes

beginning with a capital letter. The particular ques-

Figure 2. Constraint Network for evaluation ex-

tion phrasing we used does not commit us to any

ample. Dashed lines represent question-answer

specific creative verb. This is of particular impor-

pairs, solid lines constraints between the answers.

tance since it very frequently happens in text that

titled works are associated with their creators by

We used as a test corpus the AQUAINT corpus

means of a possessive or parenthetical construction,

used in TREC-QA since 2002. Since this was not

rather than subject-verb-object.

the same corpus from which the test questions were

The top five answers, with confidences, are re-

generated (the Web), we acknowledged that there

turned for the born and died questions (subject to

might be some difference in the most common spell-

also passing a confidence threshold test). The com-

ing of certain names, but we made no attempt to cor-

positions question is treated as a list question, mean-

rect for this. Neither did we attempt to normalize,

ing that all answers that pass a certain threshold are

translate or aggregate names of the titled works that

returned. For each such returned work W

i

, two addi-

were returned, so that, for example, “Well-

tional questions are asked:

Tempered Klavier” and “Well-Tempered Clavier”

ally associated with the correct artist, so our decision

were treated as different. Since only individuals

to remove them from consideration resulted in a de-

were used in the question set, we did not have in-

crease in both the numerator and denominator of the

stances of problems we saw in training, such as

precision and recall calculations, resulting in a

where an ensemble (such as The Beatles) created a

minimal effect.

The results of applying QDC to the 57 test indi-

certain piece, which in turn via the reciprocal ques-

tion was found to have been written by a single per-

viduals are summarized in Table 3. The baseline

assertions for individual X were:

son (Paul McCartney). The reverse situation was

o Top-ranking birthdate/NIL

still possible, but we did not handle it. We foresee a

o Top-ranking deathdate/NIL

future version of our system having knowledge of

o Set of works W

i

that passed threshold

ensembles and their composition, thus removing this

o Top-ranking date for W

i

/NIL

restriction. In general, a variety of ontological rela-

tionships could occur between the original individ-

The sets of baseline assertions (by individual) are

ual and the discovered performer(s) of the work.

in effect the results of QA-by-Dossier WITHOUT

We generated answer keys by reading the pas-

Constraints (QbD).

sages that the system had retrieved and from which

the answers were generated, to determine “truth”. In

Assertions Micro-Average Macro-Average

cases of absent information in these passages, we

did our own corpus searches. This of course made

Prec Rec F Prec Rec F Tru- Total Cor-

the issue of evaluation of recall only relative, since

th rect

we were not able to guarantee we had found all ex-

1671 517 933 .309 .554 .396 .331 .520 .386 Base-

isting instances.

line QDC 1417 813 933 .573 .871 .691 .603 .865 .690

We encountered some grey areas, e.g., if a paint-

ing appeared in an exhibition or if a celebrity en-

dorsed a product, then should the exhibition’s or

Table 3. Results of Performance Evaluation.

product’s name be considered an appropriate “work”

Two calculations of P/R/F are made, depending on

of the artist? The general perspective adopted was

whether the averaging is done over the whole set, or

that we were not establishing or validating the nature

first by individual; the results are very similar.

of the relationship between an individual and a crea-

tive work, but rather its existence. We answered

The QDC assertions were the same as those for

“yes” if we subjectively felt the association to be

QbD, but reflecting the following effects:

both very strong and with the individual’s participa-

o Some {W

i

, date} pairs were thrown out (3 out of

tion – for example, Pamela Anderson and Playboy.

14 on average)

However, books/plays about a person or dates of

o Some dates in positions 2-6 moved up (applica-

performances of one’s work were considered incor-

ble to birth, death and work dates)

rect. As we shall see, these decisions would not

The results show improvement in both precision

have a big impact on the outcome.

and recall, in turn determining a 75-80% relative