2 KNOWLEDGE-BASED MEASURESAMPLING CAN BE USED TO ESTIMATE THE CORRES...

3.2 Knowledge-based Measure

Sampling can be used to estimate the corresponding

Mihalcea et al. (2006) proposed several knowledge-

expected posterior probabilities P (z|D) = ˆ θ

(D)

z

and

based methods for measuring the semantic level sim-

P (w|z) = ˆ φ

(z)

w

(Griffiths and Steyvers, 2004).

ilarity of texts to solve the lexical chasm problem be-

In this paper we use two LDA based similarity

tween short texts. These knowledge-based similarity

measures in (Celikyilmaz et al., 2010) to measure

measures were derived from word semantic similar-

the similarity between short information need texts.

ity by making use of WordNet. The evaluation on a

The first LDA similarity method uses KL divergence

paraphrase recognition task showed that knowledge-

to measure the similarity between two documents

based measures outperform the simpler lexical level

under each given topic:

approach.

We follow the definition in (Mihalcea et al., 2006)

K

(z=k)

to derive a text-to-text similarity metric mcs for two

X

10

W

(D

sim

LDA1

(D

i

, D

j

) = 1

i

,D

(z=k)

j

)

K

given texts D

i

and D

j

:

k=1

P

w∈D

i

maxSim(w, D

j

) ∗ idf (w)

mcs(D

i

, D

j

) =

W (D

i

(z=k)

, D

(z=k)

j

) =

w∈D

i

idf (w)

w∈D

j

maxSim(w, D

i

) ∗ idf (w)

− KL(D

(z=k)

i

k D

(z=k)

i

+ D

(z=k)

j

+