2 KNOWLEDGE-BASED MEASURESAMPLING CAN BE USED TO ESTIMATE THE CORRES...
3.2 Knowledge-based Measure
Sampling can be used to estimate the corresponding
Mihalcea et al. (2006) proposed several knowledge-
expected posterior probabilities P (z|D) = ˆ θ
(D)
z
and
based methods for measuring the semantic level sim-
P (w|z) = ˆ φ
(z)
w
(Griffiths and Steyvers, 2004).
ilarity of texts to solve the lexical chasm problem be-
In this paper we use two LDA based similarity
tween short texts. These knowledge-based similarity
measures in (Celikyilmaz et al., 2010) to measure
measures were derived from word semantic similar-
the similarity between short information need texts.
ity by making use of WordNet. The evaluation on a
The first LDA similarity method uses KL divergence
paraphrase recognition task showed that knowledge-
to measure the similarity between two documents
based measures outperform the simpler lexical level
under each given topic:
approach.
We follow the definition in (Mihalcea et al., 2006)
K
(z=k)
to derive a text-to-text similarity metric mcs for two
X
10
W
(D
sim
LDA1
(D
i
, D
j
) = 1
i
,D
(z=k)
j
)
K
given texts D
i
and D
j
:
k=1
P
w∈D
i
maxSim(w, D
j
) ∗ idf (w)
mcs(D
i
, D
j
) =
W (D
i
(z=k)
, D
(z=k)
j
) =
w∈D
i
idf (w)
w∈D
j
maxSim(w, D
i
) ∗ idf (w)
− KL(D
(z=k)
i
k D
(z=k)
i
+ D
(z=k)
j