1999). The methods described in this paper how-
significantly.ever can also be applied to other IR scenarios, e.g.
web search. The necessary condition for our ap-
1 Introduction
proach to work is that the user query is somewhat
grammatically well formed; this kind of queries
It is a well known problem in Information Re-
are commonly referred to as Natural Language
trieval (IR) and Question Answering (QA) that
Queries or NLQs.
queries and relevant textual content often signif-
Table 1 provides evidence that users indeed
icantly differ in their properties, and are therefore
search the web with NLQs. The data is based on
difficult to match with traditional IR methods. A
two query sets sampled from three months of user
common example is a user entering words to de-
scribe their information need that do not match
logs from a popular search engine, using two dif-
the words used in the most relevant indexed doc-
ferent sampling techniques. The “head” set sam-
ples queries taking query frequency into account,
uments. This work addresses this problem, but
so that more common queries have a proportion-
shifts focus from words to syntactic structures of
ally higher chance of being selected. The “tail”
questions and relevant pieces of text. To this end,
we present a novel algorithm that analyses the de-
query set samples only queries that have been is-
88
Set Head Taildamental problem, but shifting focus from query
Query # 15,665 12,500term/document term mismatch to mismatches ob-
how 1.33% 2.42%served between the grammatical structure of Nat-
what 0.77% 1.89%ural Language Queries and relevant text pieces. In
define 0.34% 0.18%order to achieve this we analyze the queries’ and
is/are 0.25% 0.42%the relevant contents’ syntactic structure by using
where 0.18% 0.45%dependency paths.
do/does 0.14% 0.30%Especially in QA there is a strong tradition
can 0.14% 0.25%of using dependency structures: (Lin and Pan-
why 0.13% 0.30%tel, 2001) present an unsupervised algorithm to
who 0.12% 0.38%when 0.09% 0.21%automatically discover inference rules (essentially
which 0.03% 0.08%paraphrases) from text. These inference rules are
Total 3.55% 6.86%based on dependency paths, each of which con-
nects two nouns. Their paths have the following
Table 1: Percentages of Natural Language queries inform:
head and tail search engine query logs. See text fordetails.N:subj:V←find→V:obj:N→solution→N:to:N
This path represents the relation “X finds a solu-
sued less that 500 times during a three months pe-
tion to Y” and can be mapped to another path rep-
riod and it disregards query frequency. As a result,
resenting e.g. “X solves Y.” As such the approach
rare and frequent queries have the same chance of
is suitable to detect paraphrases that describe the
being selected. Doubles are excluded from both
relation between two entities in documents. How-
sets. Table 1 lists the percentage of queries in
ever, the paper does not describe how the mined
the query sets that start with the specified word.
paraphrases can be linked to questions, and which
In most contexts this indicates that the query is a
paraphrase is suitable to answer which question
question, which in turn means that we are dealing
type.
with an NLQ. Of course there are many NLQs that
(Attardi et al., 2001) describes a QA system
start with words other than the ones listed, so we
that, after a set of candidate answer sentences
can expect their real percentage to be even higher.
have been identified, matches their dependency
relations against the question. Questions and
2 Related Work
answer sentences are parsed with MiniPar (Lin,
Bạn đang xem 1999) - BÁO CÁO KHOA HỌC ANSWER SENTENCE RETRIEVAL BY MATCHING DEPENDENCY PATHS ACQUIRED FROM QUESTION ANSWER SENTENCE PAIRS PDF