.MONOLINGUAL TRANSLATION PROBABILITIES HAVEBERGER AND LAFFERTY (1...

Question

2007).Monolingual translation probabilities haveBerger and Lafferty (1999) have formulated arecently been introduced in retrieval mod-further solution to the lexical gap problem con-els to solve the lexical gap problem.sisting in integrating monolingual statistical trans-They can be obtained by training statisti-lation models in the retrieval process. Monolin-cal translation models on parallel mono-gual translation models encode statistical word as-lingual corpora, such as question-answersociations which are trained on parallel monolin-pairs, where answers act as the “source”gual corpora. The major drawback of this ap-language and questions as the “target”proach lies in the limited availability of truly par-language. In this paper, we proposeallel monolingual corpora. In practice, trainingto use as a parallel training dataset thedata for translation-based retrieval often consist indefinitions and glosses provided for thequestion-answer pairs, usually extracted from thesame term by different lexical semantic re-evaluation corpus itself (Riezler et al., 2007; Xuesources. We compare monolingual trans-et al., 2008; Lee et al., 2008). While collection-lation models built from lexical semanticspecific translation models effectively encode sta-resources with two other kinds of datasets:tistical word associations for the target documentmanually-tagged question reformulationscollection, it also introduces a bias in the evalua-and question-answer pairs. We also showtion and makes it difficult to assess the quality ofthat the monolingual translation probabil-the translation model per se, independently from aities obtained (i) are comparable to tradi-specific task and document collection.tional semantic relatedness measures andIn this paper, we propose new kinds of(ii) significantly improve the results overdatasets for training domain-independent mono-the query likelihood and the vector-spacelingual translation models. We use the defini-model for answer finding.tions and glosses provided for the same termby different lexical semantic resources to auto-1 Introductionmatically train the translation models. This ap-proach has been very recently made possible byThe lexical gap (or lexical chasm) often observedthe emergence of new kinds of lexical seman-between queries and documents or questions andtic and encyclopedic resources such as Wikipediaanswers is a pervasive problem both in Informa-and Wiktionary. These resources are freely avail-tion Retrieval (IR) and Question Answering (QA).able, up-to-date and have a broad coverage andThis problem arises from alternative ways of con-good quality. Thanks to the combination of sev-veying the same information, due to synonymyeral resources, it is possible to obtain monolin-or paraphrasing, and is especially severe for re-gual parallel corpora which are large enough totrieval over shorter documents, such as sentencetrain domain-independent translation models. Inretrieval or question retrieval in Question & An-addition, we collected question-answer pairs andswer archives. Several solutions to this problemmanually-tagged question reformulations from ahave been proposed including query expansionsocial Q&A site. We use these datasets to build(Riezler et al., 2007; Fang, 2008), query refor-further translation models.mulation or paraphrasing (Hermjakob et al., 2002;Tomuro, 2003; Zukerman and Raskutti, 2002)Translation-based retrieval models have been728widely used in practice by the IR and QA commu-Subsequent work in this area often used simi-lar kinds of training data such as question-answernity. However, the quality of the semantic infor-pairs from Yahoo! Answers (Lee et al., 2008) ormation encoded in the translation tables has neverbeen assessed intrinsically. To do so, we com-from the Wondir site (Xue et al., 2008). Lee etal. (2008) tried to further improve translation mod-pare translation probabilities with concept vectorbased semantic relatedness measures with respectels based on question-answer pairs by selecting themost important terms to build compact translationto human relatedness rankings for reference wordmodels.pairs. This study provides empirical evidence forthe high quality of the semantic information en-Other kinds of training data have also been pro-coded in statistical word translation tables. Weposed. Jeon et al. (2005) automatically clusteredthen use the translation models in an answer find-semantically similar questions based on their an-ing task based on a new question-answer datasetswers. Murdock and Croft (2005) created a firstwhich is totally independent from the resourcesparallel corpus of synonym pairs extracted fromused for training the translation models. This ex-WordNet, and an additional parallel corpus of En-trinsic evaluation shows that our translation mod-glish words translating to the same Arabic term inels significantly improve the results over the querya parallel English-Arabic corpus.likelihood and the vector-space model.Similar work has also been performed in theThe remainder of the paper is organised as fol-area of query expansion using training data con-lows. Section 2 discusses related work on seman-sisting of FAQ pages (Riezler et al., 2007) ortic relatedness and statistical translation modelsqueries and clicked snippets from query logs (Rie-for retrieval. Section 3 presents the monolingualzler et al., 2008).parallel datasets we used for obtaining monolin-All in all, translation models have been showngual translation probabilities. Semantic related-to significantly improve the retrieval resultsness experiments are detailed in Section 4. Sectionover traditional baselines for document retrieval5 presents answer finding experiments. Finally, we(Berger and Lafferty, 1999), question retrieval inconclude in Section 6.Question & Answer archives (Jeon et al., 2005;Lee et al., 2008; Xue et al., 2008) and for sentence2 Related Workretrieval (Murdock and Croft, 2005).Many of the approaches previously described