1 TEXT PREPROCESSINGSTHE QUESTIONS POSTED ON COMMUNITY QA SITES OFTE...

5.1 Text Preprocessing

s

The questions posted on community QA sites often

Different alignment models such as IBM-1 to

contain spelling or grammar errors. These errors in-

IBM-5 (Brown et al., 1993) and HMM model (Och

4

https://traloihay.net

and Ney, 2000) provide different decompositions of

Test c Test t

Methods MRR Precision@5 Precision@10 MRR Precision@5 Precision@10

TFIDF 84.2% 67.1% 61.9% 92.8% 74.8% 63.3%

Knowledge1 82.2% 65.0% 65.6% 78.1% 67.0% 69.6%

Knowledge2 76.7% 54.9% 59.3% 61.6% 53.3% 58.2%

LDA1 92.5% 68.8% 64.7% 91.8% 75.4% 69.8%

LDA2 61.5% 55.3% 60.2% 52.1% 57.4% 54.5%

Table 2: Question recommendation results without information need prediction

TFIDF 86.2% 70.8% 64.3% 95.1% 77.8% 69.3%

Knowledge1 82.2% 65.0% 66.6% 76.7% 68.0% 68.7%

Knowledge2 76.7% 54.9% 60.2% 61.6% 53.3% 58.2%

LDA1 95.8% 72.4% 68.2% 96.2% 79.5% 69.2%

LDA2 61.5% 55.3% 58.9% 68.1% 58.3% 53.9%

Table 3: Question recommendation results with information need predicted by translation model

fluence the calculation of similarity and the perfor-

(1 million), and ‘computers&internet’ (1 million).

mance of information retrieval (Zhao et al., 2007;

Depending on whether the best answers have been

chosen by the asker, questions from Yahoo! answers

Bunescu and Huang, 2010). In this paper, we use

can be divided into ‘resolved’ and ‘unresolved’ cat-

an open source software afterthedeadline

5

to auto-

egories. From each of the above two categories, we

matically correct the spelling errors in the question

randomly selected 200 resolved questions to con-

and information need texts first. We also made use

of Web 1T 5-gram

6

to implement an N-Gram based

struct two testing data sets: ‘Test t’ (‘travel’), and

method (Cheng et al., 2008) to further filter out the

‘Test c’ (‘computers&internet’). In order to mea-

false positive corrections and re-rank correction sug-

sure the information need similarity in our experi-

gestions (Mudge, 2010). The texts are tagged by

ment we selected only those questions whose infor-

mation needs part contained at least 3 informative

Brill’s Part-of-Speech Tagger

7

as the rule-based tag-

ger is more robust than the state-of-art statistical tag-

words after stop word removal. The rest of the ques-

gers for raw web contents. This tagging informa-

tions ‘Train t’ and ‘Train c’ under the two categories

are left for estimating the LDA topic models and the

tion is only used for WordNet similarity calculation.

translation models. We will show how we obtain

Stop word removal and lemmatization are applied

these models later.

to the all the raw texts before feeding into machine

translation model training, the LDA model estimat-