"NICOLE KIDMAN"• CONSTITUENTS MARKED AS FOCUS ARE GENERALLY EX...

35.0: "Nicole Kidman"

Constituents marked as Focus are generally ex-

The way we find supporting paragraphs for these

pected to be found in the text, especially if they

answers is probably best explained by giving an

are verbs. The focus indicates what the ques-

example. Figure 3 shows the Lucene query we

tion asks for, and such information can usually

use for the mentioned question and answer can-

rather be expected in the text than in titles or

didates. (The numbers behind the terms indicate

subtitles.

query weights.) As can be seen, we initially build

Figure 3 also shows that, if we recognize named

two separate queries for the Headers and the Text

entities (especially person names) in the question or

fields (compare Table 1). In a later processing step,

answer strings, we once include each named entity

both queries are combined into a single query us-

as a quoted string and additionally add the words

ing Lucene’s MultipleFieldQueryCreator

it contains separately. This is to boost documents

class. Note also that both answer candidates (“Katie

which contain the complete name as used in the

Holmes” and “Nicole Kidman”) are included in this

question or the answer, but also to allow documents

one query. This is done because of speed issues: In

which contain variants of these names, e.g. “Thomas

our setup, each query takes up roughly two seconds

Cruise Mapother IV”.

of processing time. The complexity and length of

The formula to determine the exact boost factor

a query on the other hand has very little impact on

for each query term is complex and a matter of on-

speed.

going development. It additionally depends on the

The type of question influences the query building

following criteria:

process in a fundamental manner. For the question

“When was Franz Kafka born?” and the correct an-

• Named entities receive a higher weight.

swer “July 3, 1883”, for example, it is reasonable

• Capitalized words or constituents receive a

to search for an article with title “Franz Kafka” and

higher weight.

to expect the answer in the text on that page. For

the question “Who invented the automobile?” on

• The confidence value associated with the an-

the other hand, it is more reasonable to search the

swer candidate influences the boost factor.

information on a page called “Karl Benz” (the an-

• Whether a term originates from the question or

swer to the question). In order to capture this be-

an answer candidate influences its weight in a

haviour we developed a set of rules that for differ-

different manner for the header and text fields.

ent type of questions, increases or decreases con-

stituents’ weights in either the Headers or the Text

2

With allowing verbs to be the Focus, we slightly depart

field.

from the traditional definition of the term.Header query:"Tom Cruise"ˆ10 Tomˆ5 Cruiseˆ5 "Katie Holmes"ˆ5 Katieˆ2.5 Holmes2.ˆ5"Nicole Kidman"ˆ4.3 Nicoleˆ2.2 Kidmanˆ2.2Text query:marriedˆ10 "Tom Cruise"ˆ1.5 Tomˆ4.5 Cruiseˆ4.5 "Katie Holmes"ˆ3 Katieˆ9 Holmesˆ9"Nicole Kidman"ˆ2.2 Nicoleˆ6.6 Kidmanˆ6.6Figure 2: Lucene Queries used to find supporting documents for the “Who is Tom Cruise married to?”and the two answers “Katie Holmes” and “Nicole Kidman”. Both queries are combined using Lucene’sMultipleFieldQueryCreator class.

References

4 Future Work

Erik Hatcher and Otis Gospodneti´c. 2004. Lucene in

Although QuALiM performed well in recent TREC

Action. Manning Publications Co.

evaluations, improving precision and recall will of

Michael Kaisser and Tilman Becker. 2004. Question An-

course always be on our agenda. Beside this we cur-

swering by Searching Large Corpora with Linguistic

rently focus on increasing processing speed. At the

Methods. In The Proceedings of the 2004 Edition of

time of writing, the web demo runs on a server with

the Text REtrieval Conference, TREC 2004.Michael Kaisser, Silke Scheible, and Bonnie Webber.

a single 3GHz Intel Pentium D dual core processor