1 TEXT PREPROCESSINGSTHE QUESTIONS POSTED ON COMMUNITY QA SITES OFTE...

Question

5.1 Text PreprocessingsThe questions posted on community QA sites oftenDifferent alignment models such as IBM-1 tocontain spelling or grammar errors. These errors in-IBM-5 (Brown et al., 1993) and HMM model (Och4https://traloihay.netand Ney, 2000) provide different decompositions ofTest c Test tMethods MRR Precision@5 Precision@10 MRR Precision@5 Precision@10TFIDF 84.2% 67.1% 61.9% 92.8% 74.8% 63.3%Knowledge1 82.2% 65.0% 65.6% 78.1% 67.0% 69.6%Knowledge2 76.7% 54.9% 59.3% 61.6% 53.3% 58.2%LDA1 92.5% 68.8% 64.7% 91.8% 75.4% 69.8%LDA2 61.5% 55.3% 60.2% 52.1% 57.4% 54.5%Table 2: Question recommendation results without information need predictionTFIDF 86.2% 70.8% 64.3% 95.1% 77.8% 69.3%Knowledge1 82.2% 65.0% 66.6% 76.7% 68.0% 68.7%Knowledge2 76.7% 54.9% 60.2% 61.6% 53.3% 58.2%LDA1 95.8% 72.4% 68.2% 96.2% 79.5% 69.2%LDA2 61.5% 55.3% 58.9% 68.1% 58.3% 53.9%Table 3: Question recommendation results with information need predicted by translation modelfluence the calculation of similarity and the perfor-(1 million), and ‘computers&internet’ (1 million).mance of information retrieval (Zhao et al., 2007;Depending on whether the best answers have beenchosen by the asker, questions from Yahoo! answersBunescu and Huang, 2010). In this paper, we usecan be divided into ‘resolved’ and ‘unresolved’ cat-an open source software afterthedeadline5 to auto-egories. From each of the above two categories, wematically correct the spelling errors in the questionrandomly selected 200 resolved questions to con-and information need texts first. We also made useof Web 1T 5-gram6to implement an N-Gram basedstruct two testing data sets: ‘Test t’ (‘travel’), andmethod (Cheng et al., 2008) to further filter out the‘Test c’ (‘computers&internet’). In order to mea-false positive corrections and re-rank correction sug-sure the information need similarity in our experi-gestions (Mudge, 2010). The texts are tagged byment we selected only those questions whose infor-mation needs part contained at least 3 informativeBrill’s Part-of-Speech Tagger7as the rule-based tag-ger is more robust than the state-of-art statistical tag-words after stop word removal. The rest of the ques-gers for raw web contents. This tagging informa-tions ‘Train t’ and ‘Train c’ under the two categoriesare left for estimating the LDA topic models and thetion is only used for WordNet similarity calculation.translation models. We will show how we obtainStop word removal and lemmatization are appliedthese models later.to the all the raw texts before feeding into machinetranslation model training, the LDA model estimat-

1 TEXT PREPROCESSINGSTHE QUESTIONS POSTED ON COMMUNITY QA SITES OFTE...

5.1 Text Preprocessing

The questions posted on community QA sites often

Different alignment models such as IBM-1 to

contain spelling or grammar errors. These errors in-

IBM-5 (Brown et al., 1993) and HMM model (Och

and Ney, 2000) provide different decompositions of

Test c Test t

Methods MRR Precision@5 Precision@10 MRR Precision@5 Precision@10

TFIDF 84.2% 67.1% 61.9% 92.8% 74.8% 63.3%

Knowledge1 82.2% 65.0% 65.6% 78.1% 67.0% 69.6%

Knowledge2 76.7% 54.9% 59.3% 61.6% 53.3% 58.2%

LDA1 92.5% 68.8% 64.7% 91.8% 75.4% 69.8%

LDA2 61.5% 55.3% 60.2% 52.1% 57.4% 54.5%

TFIDF 86.2% 70.8% 64.3% 95.1% 77.8% 69.3%

Knowledge1 82.2% 65.0% 66.6% 76.7% 68.0% 68.7%

Knowledge2 76.7% 54.9% 60.2% 61.6% 53.3% 58.2%

LDA1 95.8% 72.4% 68.2% 96.2% 79.5% 69.2%

LDA2 61.5% 55.3% 58.9% 68.1% 58.3% 53.9%

fluence the calculation of similarity and the perfor-

(1 million), and ‘computers&internet’ (1 million).

mance of information retrieval (Zhao et al., 2007;

Depending on whether the best answers have been

chosen by the asker, questions from Yahoo! answers

Bunescu and Huang, 2010). In this paper, we use

can be divided into ‘resolved’ and ‘unresolved’ cat-

an open source software afterthedeadline

to auto-

egories. From each of the above two categories, we

matically correct the spelling errors in the question

randomly selected 200 resolved questions to con-

and information need texts first. We also made use

of Web 1T 5-gram

to implement an N-Gram based

struct two testing data sets: ‘Test t’ (‘travel’), and

method (Cheng et al., 2008) to further filter out the

‘Test c’ (‘computers&internet’). In order to mea-

false positive corrections and re-rank correction sug-

sure the information need similarity in our experi-

gestions (Mudge, 2010). The texts are tagged by

ment we selected only those questions whose infor-

mation needs part contained at least 3 informative

Brill’s Part-of-Speech Tagger

as the rule-based tag-

ger is more robust than the state-of-art statistical tag-

words after stop word removal. The rest of the ques-

gers for raw web contents. This tagging informa-

tions ‘Train t’ and ‘Train c’ under the two categories

are left for estimating the LDA topic models and the

tion is only used for WordNet similarity calculation.

translation models. We will show how we obtain

Stop word removal and lemmatization are applied

these models later.

to the all the raw texts before feeding into machine

translation model training, the LDA model estimat-

Bạn đang xem 5. - BÁO CÁO KHOA HỌC: "IMPROVING QUESTION RECOMMENDATION BY EXPLOITING INFORMATION NEED" PPTX