2 QDC OPERATION THE APPARENT REDUNDANCY HERE IS BECAUSE OF THE THE S...

Question

4.2  QDC Operation The apparent redundancy here is because of the The system first asked three questions for each potential NIL answers for some of the date slots. subject X: We also rejected combinations of works whose years spanned more than 100 years (in case there In what year was X born? were no BORN or DIED dates). In performing these In what year did X die? constraint calculations, NIL satisfied every test by What compositions did X have? fiat. The constraint network we used is depicted in Figure 2. The third of these triggers our named-entity type COMPOSITION that is used for all kinds of titled Birthdate of X works – books, films, poems, music, plays and so Work Wi on, and also quotations. Our named-entity recog-Author X  Date of Wi Xi = Author of Wi nizer has rules to detect works of art by phrases that are in apposition to “the film … ” or the “the book Deathdate of X … ” etc., and also captures any short phrase in quotes beginning with a capital letter. The particular ques-Figure 2. Constraint Network for evaluation ex-tion phrasing we used does not commit us to any ample. Dashed lines represent question-answer specific creative verb. This is of particular impor-pairs, solid lines constraints between the answers. tance since it very frequently happens in text that titled works are associated with their creators by We used as a test corpus the AQUAINT corpus means of a possessive or parenthetical construction, used in TREC-QA since 2002. Since this was not rather than subject-verb-object. the same corpus from which the test questions were The top five answers, with confidences, are re-generated (the Web), we acknowledged that there turned for the born and  died questions (subject to might be some difference in the most common spell-also passing a confidence threshold test). The com-ing of certain names, but we made no attempt to cor-positions question is treated as a list question, mean-rect for this. Neither did we attempt to normalize, ing that all answers that pass a certain threshold are translate or aggregate names of the titled works that returned. For each such returned work Wi, two addi-were returned, so that, for example, “Well-tional questions are asked: Tempered Klavier” and “Well-Tempered Clavier” ally associated with the correct artist, so our decision were treated as different. Since only individuals to remove them from consideration resulted in a de-were used in the question set, we did not have in-crease in both the numerator and denominator of the stances of problems we saw in training, such as precision and recall calculations, resulting in a where an ensemble (such as The Beatles) created a minimal effect. The results of applying QDC to the 57 test indi-certain piece, which in turn via the reciprocal ques-tion was found to have been written by a single per-viduals are summarized in Table 3. The baseline assertions for individual X were: son (Paul McCartney). The reverse situation was o Top-ranking birthdate/NIL still possible, but we did not handle it. We foresee a o Top-ranking deathdate/NIL future version of our system having knowledge of o Set of works Wi that passed threshold ensembles and their composition, thus removing this o Top-ranking date for Wi /NIL restriction. In general, a variety of ontological rela-tionships could occur between the original individ-The sets of baseline assertions (by individual) are ual and the discovered performer(s) of the work. in effect the results of QA-by-Dossier WITHOUT We generated answer keys by reading the pas-Constraints (QbD). sages that the system had retrieved and from which the answers were generated, to determine “truth”. In  Assertions Micro-Average Macro-Average cases of absent information in these passages, we did our own corpus searches. This of course made Prec Rec F Prec Rec F Tru- Total Cor-the issue of evaluation of recall only relative, since th rect we were not able to guarantee we had found all ex-1671 517 933 .309 .554 .396 .331 .520 .386 Base-isting instances. line QDC 1417 813 933 .573 .871 .691 .603 .865 .690 We encountered some grey areas, e.g., if a paint-ing appeared in an exhibition or if a celebrity en-dorsed a product, then should the exhibition’s or Table 3. Results of Performance Evaluation. product’s name be considered an appropriate “work” Two calculations of P/R/F are made, depending on of the artist? The general perspective adopted was whether the averaging is done over the whole set, or that we were not establishing or validating the nature first by individual; the results are very similar. of the relationship between an individual and a crea-tive work, but rather its existence. We answered The QDC assertions were the same as those for “yes” if we subjectively felt the association to be QbD, but reflecting the following effects: both very strong and with the individual’s participa-o Some {Wi, date} pairs were thrown out (3 out of tion – for example, Pamela Anderson and Playboy. 14 on average) However, books/plays about a person or dates of o Some dates in positions 2-6 moved up (applica-performances of one’s work were considered incor-ble to birth, death and work dates) rect. As we shall see, these decisions would not The results show improvement in both precision have a big impact on the outcome. and recall, in turn determining a 75-80% relative