.MONOLINGUAL TRANSLATION PROBABILITIES HAVEBERGER AND LAFFERTY (1...

2007).

Monolingual translation probabilities have

Berger and Lafferty (1999) have formulated a

recently been introduced in retrieval mod-

further solution to the lexical gap problem con-

els to solve the lexical gap problem.

sisting in integrating monolingual statistical trans-

They can be obtained by training statisti-

lation models in the retrieval process. Monolin-

cal translation models on parallel mono-

gual translation models encode statistical word as-

lingual corpora, such as question-answer

sociations which are trained on parallel monolin-

pairs, where answers act as the “source”

gual corpora. The major drawback of this ap-

language and questions as the “target”

proach lies in the limited availability of truly par-

language. In this paper, we propose

allel monolingual corpora. In practice, training

to use as a parallel training dataset the

data for translation-based retrieval often consist in

definitions and glosses provided for the

question-answer pairs, usually extracted from the

same term by different lexical semantic re-

evaluation corpus itself (Riezler et al., 2007; Xue

sources. We compare monolingual trans-

et al., 2008; Lee et al., 2008). While collection-

lation models built from lexical semantic

specific translation models effectively encode sta-

resources with two other kinds of datasets:

tistical word associations for the target document

manually-tagged question reformulations

collection, it also introduces a bias in the evalua-

and question-answer pairs. We also show

tion and makes it difficult to assess the quality of

that the monolingual translation probabil-

the translation model per se, independently from a

ities obtained (i) are comparable to tradi-

specific task and document collection.

tional semantic relatedness measures and

In this paper, we propose new kinds of

(ii) significantly improve the results over

datasets for training domain-independent mono-

the query likelihood and the vector-space

lingual translation models. We use the defini-

model for answer finding.

tions and glosses provided for the same term

by different lexical semantic resources to auto-

1 Introduction

matically train the translation models. This ap-

proach has been very recently made possible by

The lexical gap (or lexical chasm) often observed

the emergence of new kinds of lexical seman-

between queries and documents or questions and

tic and encyclopedic resources such as Wikipedia

answers is a pervasive problem both in Informa-

and Wiktionary. These resources are freely avail-

tion Retrieval (IR) and Question Answering (QA).

able, up-to-date and have a broad coverage and

This problem arises from alternative ways of con-

good quality. Thanks to the combination of sev-

veying the same information, due to synonymy

eral resources, it is possible to obtain monolin-

or paraphrasing, and is especially severe for re-

gual parallel corpora which are large enough to

trieval over shorter documents, such as sentence

train domain-independent translation models. In

retrieval or question retrieval in Question & An-

addition, we collected question-answer pairs and

swer archives. Several solutions to this problem

manually-tagged question reformulations from a

have been proposed including query expansion

social Q&A site. We use these datasets to build

(Riezler et al., 2007; Fang, 2008), query refor-

further translation models.

mulation or paraphrasing (Hermjakob et al., 2002;

Tomuro, 2003; Zukerman and Raskutti, 2002)

Translation-based retrieval models have been

728

widely used in practice by the IR and QA commu-

Subsequent work in this area often used simi-

lar kinds of training data such as question-answer

nity. However, the quality of the semantic infor-

pairs from Yahoo! Answers (Lee et al., 2008) or

mation encoded in the translation tables has never

been assessed intrinsically. To do so, we com-

from the Wondir site (Xue et al., 2008). Lee et

al. (2008) tried to further improve translation mod-

pare translation probabilities with concept vector

based semantic relatedness measures with respect

els based on question-answer pairs by selecting the

most important terms to build compact translation

to human relatedness rankings for reference word

models.

pairs. This study provides empirical evidence for

the high quality of the semantic information en-

Other kinds of training data have also been pro-

coded in statistical word translation tables. We

posed. Jeon et al. (2005) automatically clustered

then use the translation models in an answer find-

semantically similar questions based on their an-

ing task based on a new question-answer dataset

swers. Murdock and Croft (2005) created a first

which is totally independent from the resources

parallel corpus of synonym pairs extracted from

used for training the translation models. This ex-

WordNet, and an additional parallel corpus of En-

trinsic evaluation shows that our translation mod-

glish words translating to the same Arabic term in

els significantly improve the results over the query

a parallel English-Arabic corpus.

likelihood and the vector-space model.

Similar work has also been performed in the

The remainder of the paper is organised as fol-

area of query expansion using training data con-

lows. Section 2 discusses related work on seman-

sisting of FAQ pages (Riezler et al., 2007) or

tic relatedness and statistical translation models

queries and clicked snippets from query logs (Rie-

for retrieval. Section 3 presents the monolingual

zler et al., 2008).

parallel datasets we used for obtaining monolin-

All in all, translation models have been shown

gual translation probabilities. Semantic related-

to significantly improve the retrieval results

ness experiments are detailed in Section 4. Section

over traditional baselines for document retrieval

5 presents answer finding experiments. Finally, we

(Berger and Lafferty, 1999), question retrieval in

conclude in Section 6.

Question & Answer archives (Jeon et al., 2005;

Lee et al., 2008; Xue et al., 2008) and for sentence

2 Related Work

retrieval (Murdock and Croft, 2005).

Many of the approaches previously described