. THE METHODS DESCRIBED IN THIS PAPER HOW-SIGNIFICANTLY.EVER CAN...

Question

1999). The methods described in this paper how-significantly.ever can also be applied to other IR scenarios, e.g.web search. The necessary condition for our ap-1 Introductionproach to work is that the user query is somewhatgrammatically well formed; this kind of queriesIt is a well known problem in Information Re-are commonly referred to as Natural Languagetrieval (IR) and Question Answering (QA) thatQueries or NLQs.queries and relevant textual content often signif-Table 1 provides evidence that users indeedicantly differ in their properties, and are thereforesearch the web with NLQs. The data is based ondifficult to match with traditional IR methods. Atwo query sets sampled from three months of usercommon example is a user entering words to de-scribe their information need that do not matchlogs from a popular search engine, using two dif-the words used in the most relevant indexed doc-ferent sampling techniques. The “head” set sam-ples queries taking query frequency into account,uments. This work addresses this problem, butso that more common queries have a proportion-shifts focus from words to syntactic structures ofally higher chance of being selected. The “tail”questions and relevant pieces of text. To this end,we present a novel algorithm that analyses the de-query set samples only queries that have been is-88Set Head Taildamental problem, but shifting focus from queryQuery # 15,665 12,500term/document term mismatch to mismatches ob-how 1.33% 2.42%served between the grammatical structure of Nat-what 0.77% 1.89%ural Language Queries and relevant text pieces. Indefine 0.34% 0.18%order to achieve this we analyze the queries’ andis/are 0.25% 0.42%the relevant contents’ syntactic structure by usingwhere 0.18% 0.45%dependency paths.do/does 0.14% 0.30%Especially in QA there is a strong traditioncan 0.14% 0.25%of using dependency structures: (Lin and Pan-why 0.13% 0.30%tel, 2001) present an unsupervised algorithm towho 0.12% 0.38%when 0.09% 0.21%automatically discover inference rules (essentiallywhich 0.03% 0.08%paraphrases) from text. These inference rules areTotal 3.55% 6.86%based on dependency paths, each of which con-nects two nouns. Their paths have the followingTable 1: Percentages of Natural Language queries inform:head and tail search engine query logs. See text fordetails.N:subj:V←find→V:obj:N→solution→N:to:NThis path represents the relation “X finds a solu-sued less that 500 times during a three months pe-tion to Y” and can be mapped to another path rep-riod and it disregards query frequency. As a result,resenting e.g. “X solves Y.” As such the approachrare and frequent queries have the same chance ofis suitable to detect paraphrases that describe thebeing selected. Doubles are excluded from bothrelation between two entities in documents. How-sets. Table 1 lists the percentage of queries inever, the paper does not describe how the minedthe query sets that start with the specified word.paraphrases can be linked to questions, and whichIn most contexts this indicates that the query is aparaphrase is suitable to answer which questionquestion, which in turn means that we are dealingtype.with an NLQ. Of course there are many NLQs that(Attardi et al., 2001) describes a QA systemstart with words other than the ones listed, so wethat, after a set of candidate answer sentencescan expect their real percentage to be even higher.have been identified, matches their dependencyrelations against the question. Questions and2 Related Workanswer sentences are parsed with MiniPar (Lin,

. THE METHODS DESCRIBED IN THIS PAPER HOW-SIGNIFICANTLY.EVER CAN...

1999). The methods described in this paper how-

ever can also be applied to other IR scenarios, e.g.

web search. The necessary condition for our ap-

1 Introduction

proach to work is that the user query is somewhat

grammatically well formed; this kind of queries

It is a well known problem in Information Re-

are commonly referred to as Natural Language

trieval (IR) and Question Answering (QA) that

Queries or NLQs.

queries and relevant textual content often signif-

Table 1 provides evidence that users indeed

icantly differ in their properties, and are therefore

search the web with NLQs. The data is based on

difficult to match with traditional IR methods. A

two query sets sampled from three months of user

common example is a user entering words to de-

scribe their information need that do not match

logs from a popular search engine, using two dif-

the words used in the most relevant indexed doc-

ferent sampling techniques. The “head” set sam-

ples queries taking query frequency into account,

uments. This work addresses this problem, but

so that more common queries have a proportion-

shifts focus from words to syntactic structures of

ally higher chance of being selected. The “tail”

questions and relevant pieces of text. To this end,

we present a novel algorithm that analyses the de-

query set samples only queries that have been is-

88

damental problem, but shifting focus from query

term/document term mismatch to mismatches ob-

served between the grammatical structure of Nat-

ural Language Queries and relevant text pieces. In

order to achieve this we analyze the queries’ and

the relevant contents’ syntactic structure by using

dependency paths.

Especially in QA there is a strong tradition

of using dependency structures: (Lin and Pan-

tel, 2001) present an unsupervised algorithm to

automatically discover inference rules (essentially

paraphrases) from text. These inference rules are

based on dependency paths, each of which con-

nects two nouns. Their paths have the following

form:

N:subj:V←find→V:obj:N→solution→N:to:N

This path represents the relation “X finds a solu-

sued less that 500 times during a three months pe-

tion to Y” and can be mapped to another path rep-

riod and it disregards query frequency. As a result,

resenting e.g. “X solves Y.” As such the approach

rare and frequent queries have the same chance of

is suitable to detect paraphrases that describe the

being selected. Doubles are excluded from both

relation between two entities in documents. How-

sets. Table 1 lists the percentage of queries in

ever, the paper does not describe how the mined

the query sets that start with the specified word.

paraphrases can be linked to questions, and which

In most contexts this indicates that the query is a

paraphrase is suitable to answer which question

question, which in turn means that we are dealing

type.

with an NLQ. Of course there are many NLQs that

(Attardi et al., 2001) describes a QA system

start with words other than the ones listed, so we

that, after a set of candidate answer sentences

can expect their real percentage to be even higher.

have been identified, matches their dependency

relations against the question. Questions and

2 Related Work

answer sentences are parsed with MiniPar (Lin,

Bạn đang xem 1999) - BÁO CÁO KHOA HỌC ANSWER SENTENCE RETRIEVAL BY MATCHING DEPENDENCY PATHS ACQUIRED FROM QUESTION ANSWER SENTENCE PAIRS PDF