5 COMBINATION OF THE DATASETSAVAILABLE FOR EACH CONCEPT, I.E. GLOSSE...

3.5 Combination of the Datasets

available for each concept, i.e. glosses in Word-

In order to investigate the role played by differ-

Net, the full article or the first paragraph of the

ent kinds of training data, we combined the sev-

article in Wikipedia or the full contents of a Wik-

eral translation models, using the two methods de-

tionary entry. We refer the reader to (Gabrilovich

scribed by Xue et al. (2008). The first method con-

and Markovitch, 2007; Zesch et al., 2008) for tech-

sists in a linear combination of the word-to-word

nical details on how the concept vectors are built

translation probabilities after training:

and used to obtain semantic relatedness values.

Table 2 lists Spearman’s rank correlation coeffi-

P

Lin

(w

i

|w

j

) = αP

W AQ

(w

i

|w

j

)

cients obtained for concept vector based measures

+ γP

W AQA

(w

i

|w

j

)

and translation probabilities. In order to ensure

+ δP

LSR

(w

i

|w

j

) (1)

a fair evaluation, we limit the comparison to the

word pairs which are contained in all resources

where α + γ + δ = 1. This approach will be

and translation tables.

labelled with the

Lin

subscript.

The second method consists in pooling the

Dataset Fin1-153 Fin2-200

training datasets, i.e. concatenating the parallel

Word pairs used 46 42

corpora, before training. This approach will be

Concept vectors

labelled with the

Pool

subscript. Examples for

WordNet .26 .46

word-to-word translations obtained with this type

Wikipedia .27 .03

of combination can be found in the last column for

Wikipedia

First

.30 .38

each word in Table 1. The ALL

Pool

setting corre-

Wiktionary .39 .58

sponds to the pooling of all three parallel datasets:

Translation probabilities

WAQ+WAQA+LSR.

WAQ .43 .65

WAQA .54 .37

4 Semantic Relatedness Experiments

LSR .51 .29

ALL

Pool

.52 .57

The aim of this first experiment is to perform an

intrinsic evaluation of the word translation proba-

Table 2: Spearman’s rank correlation coefficients

bilities obtained by comparing them to traditional

on the Fin1-153 and Fin2-200 datasets. Best val-

semantic relatedness measures on the task of rank-

ues for each dataset are in bold format. For

ing word pairs. Human judgements of semantic re-

Wikipedia

First

, the concept vectors are based on

latedness can be used to evaluate how well seman-

the first paragraph of each article.

tic relatedness measures reflect human rankings by

correlating their ranking results with Spearman’s

rank correlation coefficient. Several evaluation

The first observation is that the coverage over

datasets are available for English, but we restrict

the two evaluation datasets is rather small: only 46

our study to the larger dataset created by Finkel-

pairs have been evaluated for the Fin1-153 dataset

stein et al. (2002) due to the low coverage of many

and 42 for the Fin2-200 dataset. This is mainly

due to the natural absence of many word pairs in

we used the questions and answers contained in

the Microsoft Research Question Answering Cor-

the translation tables. Indeed, translation proba-

pus.

7

This corpus comprises approximately 1.4K

bilities can only be obtained from observed paral-

lel pairs in the training data. Concept vector based

questions collected from 10-13 year old school-

children, who were asked “If you could talk to an

measures are more flexible in that respect since the

relatedness value is based on a common represen-

encyclopedia, what would you ask it?”. The an-

swers to the questions have been manually identi-

tation in a concept vector space. It is therefore

possible to measure relatedness for a far greater

fied in the full text of Encarta 98 and annotated

with the following relevance judgements: exact

number of word pairs, as long as they share some

answer (1), off topic (3), on topic - off target (4),

concept vector dimensions. The second observa-

partial answer (5). In order to use this dataset for

tion is that, on the restricted subset of word pairs

an answer finding task, we consider the annotated

considered, the results obtained by word-to-word

answers as the documents to be retrieved and use

translation probabilities are most of the time better

than those of concept vector measures. However,

the questions as the set of test queries.

This corpus is particularly well suited to con-

the differences are not statistically significant.

6

duct experiments targeted at the lexical gap prob-

5 Answer Finding Experiments

lem: only 28% of the question-answer pairs corre-

spond to a strong match (two or more query terms