1 TEXT PREPROCESSINGSTHE QUESTIONS POSTED ON COMMUNITY QA SITES OFTE...
5.1 Text Preprocessing
s
The questions posted on community QA sites often
Different alignment models such as IBM-1 to
contain spelling or grammar errors. These errors in-
IBM-5 (Brown et al., 1993) and HMM model (Och
4
https://traloihay.netand Ney, 2000) provide different decompositions of
Test c Test t
Methods MRR Precision@5 Precision@10 MRR Precision@5 Precision@10
TFIDF 84.2% 67.1% 61.9% 92.8% 74.8% 63.3%
Knowledge1 82.2% 65.0% 65.6% 78.1% 67.0% 69.6%
Knowledge2 76.7% 54.9% 59.3% 61.6% 53.3% 58.2%
LDA1 92.5% 68.8% 64.7% 91.8% 75.4% 69.8%
LDA2 61.5% 55.3% 60.2% 52.1% 57.4% 54.5%
Table 2: Question recommendation results without information need predictionTFIDF 86.2% 70.8% 64.3% 95.1% 77.8% 69.3%
Knowledge1 82.2% 65.0% 66.6% 76.7% 68.0% 68.7%
Knowledge2 76.7% 54.9% 60.2% 61.6% 53.3% 58.2%
LDA1 95.8% 72.4% 68.2% 96.2% 79.5% 69.2%
LDA2 61.5% 55.3% 58.9% 68.1% 58.3% 53.9%
Table 3: Question recommendation results with information need predicted by translation modelfluence the calculation of similarity and the perfor-
(1 million), and ‘computers&internet’ (1 million).
mance of information retrieval (Zhao et al., 2007;
Depending on whether the best answers have been
chosen by the asker, questions from Yahoo! answers
Bunescu and Huang, 2010). In this paper, we use
can be divided into ‘resolved’ and ‘unresolved’ cat-
an open source software afterthedeadline
5
to auto-
egories. From each of the above two categories, we
matically correct the spelling errors in the question
randomly selected 200 resolved questions to con-
and information need texts first. We also made use
of Web 1T 5-gram
6
to implement an N-Gram based
struct two testing data sets: ‘Test t’ (‘travel’), and
method (Cheng et al., 2008) to further filter out the
‘Test c’ (‘computers&internet’). In order to mea-
false positive corrections and re-rank correction sug-
sure the information need similarity in our experi-
gestions (Mudge, 2010). The texts are tagged by
ment we selected only those questions whose infor-
mation needs part contained at least 3 informative
Brill’s Part-of-Speech Tagger
7