5 COMBINATION OF THE DATASETSAVAILABLE FOR EACH CONCEPT, I.E. GLOSSE...

Question

3.5 Combination of the Datasetsavailable for each concept, i.e. glosses in Word-In order to investigate the role played by differ-Net, the full article or the first paragraph of theent kinds of training data, we combined the sev-article in Wikipedia or the full contents of a Wik-eral translation models, using the two methods de-tionary entry. We refer the reader to (Gabrilovichscribed by Xue et al. (2008). The first method con-and Markovitch, 2007; Zesch et al., 2008) for tech-sists in a linear combination of the word-to-wordnical details on how the concept vectors are builttranslation probabilities after training:and used to obtain semantic relatedness values.Table 2 lists Spearman’s rank correlation coeffi-PLin(wi|wj) = αPW AQ(wi|wj)cients obtained for concept vector based measures+ γPW AQA(wi|wj)and translation probabilities. In order to ensure+ δPLSR(wi|wj) (1)a fair evaluation, we limit the comparison to theword pairs which are contained in all resourceswhere α + γ + δ = 1. This approach will beand translation tables.labelled with theLinsubscript.The second method consists in pooling theDataset Fin1-153 Fin2-200training datasets, i.e. concatenating the parallelWord pairs used 46 42corpora, before training. This approach will beConcept vectorslabelled with the Pool subscript. Examples forWordNet .26 .46word-to-word translations obtained with this typeWikipedia .27 .03of combination can be found in the last column forWikipediaFirst .30 .38each word in Table 1. The ALLPoolsetting corre-Wiktionary .39 .58sponds to the pooling of all three parallel datasets:Translation probabilitiesWAQ+WAQA+LSR.WAQ .43 .65WAQA .54 .374 Semantic Relatedness ExperimentsLSR .51 .29ALLPool .52 .57The aim of this first experiment is to perform anintrinsic evaluation of the word translation proba-Table 2: Spearman’s rank correlation coefficientsbilities obtained by comparing them to traditionalon the Fin1-153 and Fin2-200 datasets. Best val-semantic relatedness measures on the task of rank-ues for each dataset are in bold format. Foring word pairs. Human judgements of semantic re-WikipediaFirst, the concept vectors are based onlatedness can be used to evaluate how well seman-the first paragraph of each article.tic relatedness measures reflect human rankings bycorrelating their ranking results with Spearman’srank correlation coefficient. Several evaluationThe first observation is that the coverage overdatasets are available for English, but we restrictthe two evaluation datasets is rather small: only 46our study to the larger dataset created by Finkel-pairs have been evaluated for the Fin1-153 datasetstein et al. (2002) due to the low coverage of manyand 42 for the Fin2-200 dataset. This is mainlydue to the natural absence of many word pairs inwe used the questions and answers contained inthe Microsoft Research Question Answering Cor-the translation tables. Indeed, translation proba-pus.7 This corpus comprises approximately 1.4Kbilities can only be obtained from observed paral-lel pairs in the training data. Concept vector basedquestions collected from 10-13 year old school-children, who were asked “If you could talk to anmeasures are more flexible in that respect since therelatedness value is based on a common represen-encyclopedia, what would you ask it?”. The an-swers to the questions have been manually identi-tation in a concept vector space. It is thereforepossible to measure relatedness for a far greaterfied in the full text of Encarta 98 and annotatedwith the following relevance judgements: exactnumber of word pairs, as long as they share someanswer (1), off topic (3), on topic - off target (4),concept vector dimensions. The second observa-partial answer (5). In order to use this dataset fortion is that, on the restricted subset of word pairsan answer finding task, we consider the annotatedconsidered, the results obtained by word-to-wordanswers as the documents to be retrieved and usetranslation probabilities are most of the time betterthan those of concept vector measures. However,the questions as the set of test queries.This corpus is particularly well suited to con-the differences are not statistically significant.6duct experiments targeted at the lexical gap prob-5 Answer Finding Experimentslem: only 28% of the question-answer pairs corre-spond to a strong match (two or more query terms