1 TFIDFET AL. (2005B) ALSO DISCUSSED METHODS FOR GROUPINGSIMILAR QUE...

3.1 TFIDF

et al. (2005b) also discussed methods for grouping

similar questions based on using the similarity be-

Baeza-Yates and Ribeiro-Neto (1999) provides a T-

tween answers in the archive. These grouped ques-

FIDF method to calculate the similarity between two

tion pairs were further used as training data to es-

texts. Each document is represented by a term vec-

timate probabilities for a translation-based question

tor using TFIDF score. The similarity between two

retrieval model. Wang et al. (2009) proposed a tree

text D

i

and D

j

is the cosine similarity in the vector

kernel framework to find similar questions in the C-

space model:

QA archive based on syntactic tree structures. Wang

et al. (2010) mined lexical and syntactic features to

cos(D

i

, D

j

) = D

i

T

D

j

detect question sentences in CQA data.

kD

i

kkD

j

k

This method is used in most information retrieval

topics discovered by Latent Dirichlet Allocation (L-

systems as it is both efficient and effective. Howev-

DA) methods.

er if the query text contains only one or two words

In contrast to the TFIDF method which measures

this method will be biased to shorter answer texts

“common words”, short texts are not compared to

(Jeon et al., 2005a). We also found that in CQA data

each other directly in probabilistic topic models. In-

short contents in the question body cannot provide

stead, the texts are compared using some “third-

any information about the users’ information needs.

party” topics that relate to them. A passage D in the

Based on the above two reasons, in the test data sets

retrieved documents (document collection) is repre-

we do not include the questions whose information

sented as a mixture of fixed topics, with topic z get-

need parts contain only a few noninformative words

ting weight θ

z

(D)

in passage D and each topic is a

.

distribution over a finite vocabulary of words, with

word w having a probability φ

(z)

w

in topic z. Gibbs