1 TFIDFET AL. (2005B) ALSO DISCUSSED METHODS FOR GROUPINGSIMILAR QUE...

Question

3.1 TFIDFet al. (2005b) also discussed methods for groupingsimilar questions based on using the similarity be-Baeza-Yates and Ribeiro-Neto (1999) provides a T-tween answers in the archive. These grouped ques-FIDF method to calculate the similarity between twotion pairs were further used as training data to es-texts. Each document is represented by a term vec-timate probabilities for a translation-based questiontor using TFIDF score. The similarity between tworetrieval model. Wang et al. (2009) proposed a treetext Di and Dj is the cosine similarity in the vectorkernel framework to find similar questions in the C-space model:QA archive based on syntactic tree structures. Wanget al. (2010) mined lexical and syntactic features tocos(Di, Dj) = DiTDjdetect question sentences in CQA data.kDikkDjkThis method is used in most information retrievaltopics discovered by Latent Dirichlet Allocation (L-systems as it is both efficient and effective. Howev-DA) methods.er if the query text contains only one or two wordsIn contrast to the TFIDF method which measuresthis method will be biased to shorter answer texts“common words”, short texts are not compared to(Jeon et al., 2005a). We also found that in CQA dataeach other directly in probabilistic topic models. In-short contents in the question body cannot providestead, the texts are compared using some “third-any information about the users’ information needs.party” topics that relate to them. A passage D in theBased on the above two reasons, in the test data setsretrieved documents (document collection) is repre-we do not include the questions whose informationsented as a mixture of fixed topics, with topic z get-need parts contain only a few noninformative wordsting weight θz(D) in passage D and each topic is a.distribution over a finite vocabulary of words, withword w having a probability φ(z)w in topic z. Gibbs

1 TFIDFET AL. (2005B) ALSO DISCUSSED METHODS FOR GROUPINGSIMILAR QUE...

3.1 TFIDF

et al. (2005b) also discussed methods for grouping

similar questions based on using the similarity be-

Baeza-Yates and Ribeiro-Neto (1999) provides a T-

tween answers in the archive. These grouped ques-

FIDF method to calculate the similarity between two

tion pairs were further used as training data to es-

texts. Each document is represented by a term vec-

timate probabilities for a translation-based question

tor using TFIDF score. The similarity between two

retrieval model. Wang et al. (2009) proposed a tree

text D

and D

is the cosine similarity in the vector

kernel framework to find similar questions in the C-

space model:

QA archive based on syntactic tree structures. Wang

et al. (2010) mined lexical and syntactic features to

cos(D

, D

) = D

D

detect question sentences in CQA data.

kD

kkD

k

This method is used in most information retrieval

topics discovered by Latent Dirichlet Allocation (L-

systems as it is both efficient and effective. Howev-

DA) methods.

er if the query text contains only one or two words

In contrast to the TFIDF method which measures

this method will be biased to shorter answer texts

“common words”, short texts are not compared to

(Jeon et al., 2005a). We also found that in CQA data

each other directly in probabilistic topic models. In-

short contents in the question body cannot provide

stead, the texts are compared using some “third-

any information about the users’ information needs.

party” topics that relate to them. A passage D in the

Based on the above two reasons, in the test data sets

retrieved documents (document collection) is repre-

we do not include the questions whose information

sented as a mixture of fixed topics, with topic z get-

need parts contain only a few noninformative words

ting weight θ

in passage D and each topic is a

.

distribution over a finite vocabulary of words, with

word w having a probability φ

in topic z. Gibbs

Bạn đang xem 3. - BÁO CÁO KHOA HỌC: "IMPROVING QUESTION RECOMMENDATION BY EXPLOITING INFORMATION NEED" PPTX