5 COMBINATION OF THE DATASETSAVAILABLE FOR EACH CONCEPT, I.E. GLOSSE...
3.5 Combination of the Datasets
available for each concept, i.e. glosses in Word-
In order to investigate the role played by differ-
Net, the full article or the first paragraph of the
ent kinds of training data, we combined the sev-
article in Wikipedia or the full contents of a Wik-
eral translation models, using the two methods de-
tionary entry. We refer the reader to (Gabrilovich
scribed by Xue et al. (2008). The first method con-
and Markovitch, 2007; Zesch et al., 2008) for tech-
sists in a linear combination of the word-to-word
nical details on how the concept vectors are built
translation probabilities after training:
and used to obtain semantic relatedness values.
Table 2 lists Spearman’s rank correlation coeffi-
P
Lin
(w
i
|w
j
) = αP
W AQ
(w
i
|w
j
)
cients obtained for concept vector based measures
+ γP
W AQA
(w
i
|w
j
)
and translation probabilities. In order to ensure
+ δP
LSR
(w
i
|w
j
) (1)
a fair evaluation, we limit the comparison to the
word pairs which are contained in all resources
where α + γ + δ = 1. This approach will be
and translation tables.
labelled with the
Lin
subscript.
The second method consists in pooling the
Dataset Fin1-153 Fin2-200
training datasets, i.e. concatenating the parallel
Word pairs used 46 42
corpora, before training. This approach will be
Concept vectors
labelled with the
Pool
subscript. Examples for
WordNet .26 .46
word-to-word translations obtained with this type
Wikipedia .27 .03
of combination can be found in the last column for
Wikipedia
First
.30 .38
each word in Table 1. The ALL
Pool
setting corre-
Wiktionary .39 .58
sponds to the pooling of all three parallel datasets:
Translation probabilities
WAQ+WAQA+LSR.
WAQ .43 .65
WAQA .54 .37
4 Semantic Relatedness Experiments
LSR .51 .29
ALL
Pool
.52 .57
The aim of this first experiment is to perform an
intrinsic evaluation of the word translation proba-
Table 2: Spearman’s rank correlation coefficients
bilities obtained by comparing them to traditional
on the Fin1-153 and Fin2-200 datasets. Best val-
semantic relatedness measures on the task of rank-
ues for each dataset are in bold format. For
ing word pairs. Human judgements of semantic re-
Wikipedia
First
, the concept vectors are based on
latedness can be used to evaluate how well seman-
the first paragraph of each article.
tic relatedness measures reflect human rankings by
correlating their ranking results with Spearman’s
rank correlation coefficient. Several evaluation
The first observation is that the coverage over
datasets are available for English, but we restrict
the two evaluation datasets is rather small: only 46
our study to the larger dataset created by Finkel-
pairs have been evaluated for the Fin1-153 dataset
stein et al. (2002) due to the low coverage of many
and 42 for the Fin2-200 dataset. This is mainly
due to the natural absence of many word pairs in
we used the questions and answers contained in
the Microsoft Research Question Answering Cor-
the translation tables. Indeed, translation proba-
pus.
7
This corpus comprises approximately 1.4K
bilities can only be obtained from observed paral-
lel pairs in the training data. Concept vector based
questions collected from 10-13 year old school-
children, who were asked “If you could talk to an
measures are more flexible in that respect since the
relatedness value is based on a common represen-
encyclopedia, what would you ask it?”. The an-
swers to the questions have been manually identi-
tation in a concept vector space. It is therefore
possible to measure relatedness for a far greater
fied in the full text of Encarta 98 and annotated
with the following relevance judgements: exact
number of word pairs, as long as they share some
answer (1), off topic (3), on topic - off target (4),
concept vector dimensions. The second observa-
partial answer (5). In order to use this dataset for
tion is that, on the restricted subset of word pairs
an answer finding task, we consider the annotated
considered, the results obtained by word-to-word
answers as the documents to be retrieved and use
translation probabilities are most of the time better
than those of concept vector measures. However,
the questions as the set of test queries.
This corpus is particularly well suited to con-
the differences are not statistically significant.
6