2007).
Monolingual translation probabilities have
Berger and Lafferty (1999) have formulated a
recently been introduced in retrieval mod-
further solution to the lexical gap problem con-
els to solve the lexical gap problem.
sisting in integrating monolingual statistical trans-
They can be obtained by training statisti-
lation models in the retrieval process. Monolin-
cal translation models on parallel mono-
gual translation models encode statistical word as-
lingual corpora, such as question-answer
sociations which are trained on parallel monolin-
pairs, where answers act as the “source”
gual corpora. The major drawback of this ap-
language and questions as the “target”
proach lies in the limited availability of truly par-
language. In this paper, we propose
allel monolingual corpora. In practice, training
to use as a parallel training dataset the
data for translation-based retrieval often consist in
definitions and glosses provided for the
question-answer pairs, usually extracted from the
same term by different lexical semantic re-
evaluation corpus itself (Riezler et al., 2007; Xue
sources. We compare monolingual trans-
et al., 2008; Lee et al., 2008). While collection-
lation models built from lexical semantic
specific translation models effectively encode sta-
resources with two other kinds of datasets:
tistical word associations for the target document
manually-tagged question reformulations
collection, it also introduces a bias in the evalua-
and question-answer pairs. We also show
tion and makes it difficult to assess the quality of
that the monolingual translation probabil-
the translation model per se, independently from a
ities obtained (i) are comparable to tradi-
specific task and document collection.
tional semantic relatedness measures and
In this paper, we propose new kinds of
(ii) significantly improve the results over
datasets for training domain-independent mono-
the query likelihood and the vector-space
lingual translation models. We use the defini-
model for answer finding.
tions and glosses provided for the same term
by different lexical semantic resources to auto-
1 Introduction
matically train the translation models. This ap-
proach has been very recently made possible by
The lexical gap (or lexical chasm) often observed
the emergence of new kinds of lexical seman-
between queries and documents or questions and
tic and encyclopedic resources such as Wikipedia
answers is a pervasive problem both in Informa-
and Wiktionary. These resources are freely avail-
tion Retrieval (IR) and Question Answering (QA).
able, up-to-date and have a broad coverage and
This problem arises from alternative ways of con-
good quality. Thanks to the combination of sev-
veying the same information, due to synonymy
eral resources, it is possible to obtain monolin-
or paraphrasing, and is especially severe for re-
gual parallel corpora which are large enough to
trieval over shorter documents, such as sentence
train domain-independent translation models. In
retrieval or question retrieval in Question & An-
addition, we collected question-answer pairs and
swer archives. Several solutions to this problem
manually-tagged question reformulations from a
have been proposed including query expansion
social Q&A site. We use these datasets to build
(Riezler et al., 2007; Fang, 2008), query refor-
further translation models.
mulation or paraphrasing (Hermjakob et al., 2002;
Tomuro, 2003; Zukerman and Raskutti, 2002)
Translation-based retrieval models have been
728
widely used in practice by the IR and QA commu-
Subsequent work in this area often used simi-
lar kinds of training data such as question-answer
nity. However, the quality of the semantic infor-
pairs from Yahoo! Answers (Lee et al., 2008) or
mation encoded in the translation tables has never
been assessed intrinsically. To do so, we com-
from the Wondir site (Xue et al., 2008). Lee et
al. (2008) tried to further improve translation mod-
pare translation probabilities with concept vector
based semantic relatedness measures with respect
els based on question-answer pairs by selecting the
most important terms to build compact translation
to human relatedness rankings for reference word
models.
pairs. This study provides empirical evidence for
the high quality of the semantic information en-
Other kinds of training data have also been pro-
coded in statistical word translation tables. We
posed. Jeon et al. (2005) automatically clustered
then use the translation models in an answer find-
semantically similar questions based on their an-
ing task based on a new question-answer dataset
swers. Murdock and Croft (2005) created a first
which is totally independent from the resources
parallel corpus of synonym pairs extracted from
used for training the translation models. This ex-
WordNet, and an additional parallel corpus of En-
trinsic evaluation shows that our translation mod-
glish words translating to the same Arabic term in
els significantly improve the results over the query
a parallel English-Arabic corpus.
likelihood and the vector-space model.
Similar work has also been performed in the
The remainder of the paper is organised as fol-
area of query expansion using training data con-
lows. Section 2 discusses related work on seman-
sisting of FAQ pages (Riezler et al., 2007) or
tic relatedness and statistical translation models
queries and clicked snippets from query logs (Rie-
for retrieval. Section 3 presents the monolingual
zler et al., 2008).
parallel datasets we used for obtaining monolin-
All in all, translation models have been shown
gual translation probabilities. Semantic related-
to significantly improve the retrieval results
ness experiments are detailed in Section 4. Section
over traditional baselines for document retrieval
5 presents answer finding experiments. Finally, we
(Berger and Lafferty, 1999), question retrieval in
conclude in Section 6.
Question & Answer archives (Jeon et al., 2005;
Lee et al., 2008; Xue et al., 2008) and for sentence
2 Related Work
retrieval (Murdock and Croft, 2005).
Many of the approaches previously described
Bạn đang xem 2007) - TÀI LIỆU BÁO CÁO KHOA HỌC COMBINING LEXICAL SEMANTIC RESOURCES WITH QUESTION & ANSWER ARCHIVES FOR TRANSLATION BASED ANSWER FINDING DOC