1 TFIDFET AL. (2005B) ALSO DISCUSSED METHODS FOR GROUPINGSIMILAR QUE...
3.1 TFIDF
et al. (2005b) also discussed methods for grouping
similar questions based on using the similarity be-
Baeza-Yates and Ribeiro-Neto (1999) provides a T-
tween answers in the archive. These grouped ques-
FIDF method to calculate the similarity between two
tion pairs were further used as training data to es-
texts. Each document is represented by a term vec-
timate probabilities for a translation-based question
tor using TFIDF score. The similarity between two
retrieval model. Wang et al. (2009) proposed a tree
text D
i
and D
j
is the cosine similarity in the vector
kernel framework to find similar questions in the C-
space model:
QA archive based on syntactic tree structures. Wang
et al. (2010) mined lexical and syntactic features to
cos(D
i
, D
j
) = D
i
T
D
j
detect question sentences in CQA data.
kD
i
kkD
j
k
This method is used in most information retrieval
topics discovered by Latent Dirichlet Allocation (L-
systems as it is both efficient and effective. Howev-
DA) methods.
er if the query text contains only one or two words
In contrast to the TFIDF method which measures
this method will be biased to shorter answer texts
“common words”, short texts are not compared to
(Jeon et al., 2005a). We also found that in CQA data
each other directly in probabilistic topic models. In-
short contents in the question body cannot provide
stead, the texts are compared using some “third-
any information about the users’ information needs.
party” topics that relate to them. A passage D in the
Based on the above two reasons, in the test data sets
retrieved documents (document collection) is repre-
we do not include the questions whose information
sented as a mixture of fixed topics, with topic z get-
need parts contain only a few noninformative words
ting weight θ
z
(D)
in passage D and each topic is a
.
distribution over a finite vocabulary of words, with
word w having a probability φ
(z)
w