"NICOLE KIDMAN"• CONSTITUENTS MARKED AS FOCUS ARE GENERALLY EX...
35.0: "Nicole Kidman"
• Constituents marked as Focus are generally ex-
The way we find supporting paragraphs for these
pected to be found in the text, especially if they
answers is probably best explained by giving an
are verbs. The focus indicates what the ques-
example. Figure 3 shows the Lucene query we
tion asks for, and such information can usually
use for the mentioned question and answer can-
rather be expected in the text than in titles or
didates. (The numbers behind the terms indicate
subtitles.
query weights.) As can be seen, we initially build
Figure 3 also shows that, if we recognize named
two separate queries for the Headers and the Text
entities (especially person names) in the question or
fields (compare Table 1). In a later processing step,
answer strings, we once include each named entity
both queries are combined into a single query us-
as a quoted string and additionally add the words
ing Lucene’s MultipleFieldQueryCreator
it contains separately. This is to boost documents
class. Note also that both answer candidates (“Katie
which contain the complete name as used in the
Holmes” and “Nicole Kidman”) are included in this
question or the answer, but also to allow documents
one query. This is done because of speed issues: In
which contain variants of these names, e.g. “Thomas
our setup, each query takes up roughly two seconds
Cruise Mapother IV”.
of processing time. The complexity and length of
The formula to determine the exact boost factor
a query on the other hand has very little impact on
for each query term is complex and a matter of on-
speed.
going development. It additionally depends on the
The type of question influences the query building
following criteria:
process in a fundamental manner. For the question
“When was Franz Kafka born?” and the correct an-
• Named entities receive a higher weight.
swer “July 3, 1883”, for example, it is reasonable
• Capitalized words or constituents receive a
to search for an article with title “Franz Kafka” and
higher weight.
to expect the answer in the text on that page. For
the question “Who invented the automobile?” on
• The confidence value associated with the an-
the other hand, it is more reasonable to search the
swer candidate influences the boost factor.
information on a page called “Karl Benz” (the an-
• Whether a term originates from the question or
swer to the question). In order to capture this be-
an answer candidate influences its weight in a
haviour we developed a set of rules that for differ-
different manner for the header and text fields.
ent type of questions, increases or decreases con-
stituents’ weights in either the Headers or the Text
2
With allowing verbs to be the Focus, we slightly depart