"NICOLE KIDMAN"• CONSTITUENTS MARKED AS FOCUS ARE GENERALLY EX...

Question

35.0: &#34;Nicole Kidman&#34;• Constituents marked as Focus are generally ex-The way we find supporting paragraphs for thesepected to be found in the text, especially if theyanswers is probably best explained by giving anare verbs. The focus indicates what the ques-example. Figure 3 shows the Lucene query wetion asks for, and such information can usuallyuse for the mentioned question and answer can-rather be expected in the text than in titles ordidates. (The numbers behind the terms indicatesubtitles.query weights.) As can be seen, we initially buildFigure 3 also shows that, if we recognize namedtwo separate queries for the Headers and the Textentities (especially person names) in the question orfields (compare Table 1). In a later processing step,answer strings, we once include each named entityboth queries are combined into a single query us-as a quoted string and additionally add the wordsing Lucene’s MultipleFieldQueryCreatorit contains separately. This is to boost documentsclass. Note also that both answer candidates (“Katiewhich contain the complete name as used in theHolmes” and “Nicole Kidman”) are included in thisquestion or the answer, but also to allow documentsone query. This is done because of speed issues: Inwhich contain variants of these names, e.g. “Thomasour setup, each query takes up roughly two secondsCruise Mapother IV”.of processing time. The complexity and length ofThe formula to determine the exact boost factora query on the other hand has very little impact onfor each query term is complex and a matter of on-speed.going development. It additionally depends on theThe type of question influences the query buildingfollowing criteria:process in a fundamental manner. For the question“When was Franz Kafka born?” and the correct an-• Named entities receive a higher weight.swer “July 3, 1883”, for example, it is reasonable• Capitalized words or constituents receive ato search for an article with title “Franz Kafka” andhigher weight.to expect the answer in the text on that page. Forthe question “Who invented the automobile?” on• The confidence value associated with the an-the other hand, it is more reasonable to search theswer candidate influences the boost factor.information on a page called “Karl Benz” (the an-• Whether a term originates from the question orswer to the question). In order to capture this be-an answer candidate influences its weight in ahaviour we developed a set of rules that for differ-different manner for the header and text fields.ent type of questions, increases or decreases con-stituents’ weights in either the Headers or the Text2With allowing verbs to be the Focus, we slightly departfield.from the traditional definition of the term.Header query:&#34;Tom Cruise&#34;ˆ10 Tomˆ5 Cruiseˆ5 &#34;Katie Holmes&#34;ˆ5 Katieˆ2.5 Holmes2.ˆ5&#34;Nicole Kidman&#34;ˆ4.3 Nicoleˆ2.2 Kidmanˆ2.2Text query:marriedˆ10 &#34;Tom Cruise&#34;ˆ1.5 Tomˆ4.5 Cruiseˆ4.5 &#34;Katie Holmes&#34;ˆ3 Katieˆ9 Holmesˆ9&#34;Nicole Kidman&#34;ˆ2.2 Nicoleˆ6.6 Kidmanˆ6.6Figure 2: Lucene Queries used to find supporting documents for the “Who is Tom Cruise married to?”and the two answers “Katie Holmes” and “Nicole Kidman”. Both queries are combined using Lucene’sMultipleFieldQueryCreator class.References4 Future WorkErik Hatcher and Otis Gospodneti´c. 2004. Lucene inAlthough QuALiM performed well in recent TRECAction. Manning Publications Co.evaluations, improving precision and recall will ofMichael Kaisser and Tilman Becker. 2004. Question An-course always be on our agenda. Beside this we cur-swering by Searching Large Corpora with Linguisticrently focus on increasing processing speed. At theMethods. In The Proceedings of the 2004 Edition oftime of writing, the web demo runs on a server withthe Text REtrieval Conference, TREC 2004.Michael Kaisser, Silke Scheible, and Bonnie Webber.a single 3GHz Intel Pentium D dual core processor

"NICOLE KIDMAN"• CONSTITUENTS MARKED AS FOCUS ARE GENERALLY EX...

35.0: "Nicole Kidman"

• Constituents marked as Focus are generally ex-

The way we find supporting paragraphs for these

pected to be found in the text, especially if they

answers is probably best explained by giving an

are verbs. The focus indicates what the ques-

example. Figure 3 shows the Lucene query we

tion asks for, and such information can usually

use for the mentioned question and answer can-

rather be expected in the text than in titles or

didates. (The numbers behind the terms indicate

subtitles.

query weights.) As can be seen, we initially build

Figure 3 also shows that, if we recognize named

two separate queries for the Headers and the Text

entities (especially person names) in the question or

fields (compare Table 1). In a later processing step,

answer strings, we once include each named entity

both queries are combined into a single query us-

as a quoted string and additionally add the words

ing Lucene’s MultipleFieldQueryCreator

it contains separately. This is to boost documents

class. Note also that both answer candidates (“Katie

which contain the complete name as used in the

Holmes” and “Nicole Kidman”) are included in this

question or the answer, but also to allow documents

one query. This is done because of speed issues: In

which contain variants of these names, e.g. “Thomas

our setup, each query takes up roughly two seconds

Cruise Mapother IV”.

of processing time. The complexity and length of

The formula to determine the exact boost factor

a query on the other hand has very little impact on

for each query term is complex and a matter of on-

speed.

going development. It additionally depends on the

The type of question influences the query building

following criteria:

process in a fundamental manner. For the question

“When was Franz Kafka born?” and the correct an-

• Named entities receive a higher weight.

swer “July 3, 1883”, for example, it is reasonable

• Capitalized words or constituents receive a

to search for an article with title “Franz Kafka” and

higher weight.

to expect the answer in the text on that page. For

the question “Who invented the automobile?” on

• The confidence value associated with the an-

the other hand, it is more reasonable to search the

swer candidate influences the boost factor.

information on a page called “Karl Benz” (the an-

• Whether a term originates from the question or

swer to the question). In order to capture this be-

an answer candidate influences its weight in a

haviour we developed a set of rules that for differ-

different manner for the header and text fields.

ent type of questions, increases or decreases con-

stituents’ weights in either the Headers or the Text

field.

References

4 Future Work

Although QuALiM performed well in recent TREC

evaluations, improving precision and recall will of

course always be on our agenda. Beside this we cur-

rently focus on increasing processing speed. At the

time of writing, the web demo runs on a server with

a single 3GHz Intel Pentium D dual core processor

Bạn đang xem 35. - TÀI LIỆU BÁO CÁO KHOA HỌC THE QUALIM QUESTION ANSWERING DEMO SUPPLEMENTING ANSWERS WITH PARAGRAPHS DRAWN FROM WIKIPEDIA PPT