4 RANKING CANDIDATE HISTORICAL QUESTIONS1 STUFFY NOSE INTERNET EXPLO...
3.4 Ranking Candidate Historical Questions
1 stuffy nose internet explorerUnlike the word-based translation models, the
2 cold iephrase-based translation model cannot be interpo-
3 stuffy internet browserlated with a unigram language model. Following
4 sore throat explorer(Sun et al., 2010; Gao et al., 2010), we resort to
5 sneeze browsera linear ranking framework for question retrieval in
Table 2: Phrase translation probability examples. Eachwhich different models are incorporated as features.
column shows the top 5 target phrases learned from theWe consider learning a relevance function of the
word-aligned question-answer pairs.following general, linear form:
Score(q, D) =θT
·Φ(q, D) (19)useful for contextual lexical selection with sufficient
where the feature vector Φ(q, D) is an arbitrary
training data, but can be subject to data sparsity is-
function that maps (q, D) to a real value, i.e.,
sues (Sun et al., 2010; Gao et al., 2010). An alter-
Φ(q, D) ∈ R . θ is the corresponding weight vec-
nate translation probability estimate not subject to
tor, we optimize this parameter for our evaluation
data sparsity is the so-called lexical weight estimate
metrics directly using the Powell Search algorithm
(Koehn et al., 2003). Let P (w|t) be the word-to-
(Paul et al., 1992) via cross-validation.
word translation probability, and let A be the word
The features used in this paper are as follows:
alignment between w and t. Here, the word align-
ment contains (i, j) pairs, where i ∈ 1 . . . |w| and
• Phrase translation features (PT):
j ∈ 0 . . . | t | , with 0 indicating a null word. Then we
Φ
P T
(q, D, A) = logP (q | D), where P (q | D)
use the following estimate:
is computed using equations (12) to (15), and
the phrase translation probability P(w | t) is
|
w
|
∑∏estimated using equation (17).
1P(wi
|tj
)Pt
(w|t, A) =|{j|(j, i)∈A}|• Inverted Phrase translation features (IPT):
∀
(i,j)
∈
A
i=1
(18)Φ
IP T
(D, q, A) = logP (D | q), where P (D | q)
We assume that for each position in w, there is ei-
is computed using equations (12) to (15) ex-
ther a single alignment to 0, or multiple alignments
cept that we set µ
2
= 0 in equation (15), and
to non-zero positions in t. In fact, equation (18)
the phrase translation probability P (w | t) is es-
computes a product of per-word translation scores;
timated using equation (17).
the per-word scores are the averages of all the trans-
• Lexical weight feature (LW):
lations for the alignment links of that word. The
Φ
LW
(q, D, A) = logP(q|D), here P (q|D)
word translation probabilities are calculated using
is computed by equations (12) to (15), and the
IBM 1, which has been widely used for question re-
phrase translation probability is computed as
trieval (Jeon et al., 2005; Xue et al., 2008; Lee et al.,
lexical weight according to equation (18).
2008; Bernhard and Gurevych, 2009). These word-
based scores of bi-phrases, though not as effective
• Inverted Lexical weight feature (ILW):
in contextual selection, are more robust to noise and
Φ
ILW
(D, q, A) = logP (D | q), here P (D | q)
sparsity.
is computed by equations (12) to (15) except
A sample of the resulting phrase translation ex-
that we set µ
2
= 0 in equation (15), and the
amples is shown in Table 2, where the top 5 target
phrases are translated from the source phrases ac-
cording to the phrase-based translation model. For
• Phrase alignment features (PA):
example, the term “explorer” used alone, most likely
Φ
P A
(q, D, B) = ∑
K
refers to a person who engages in scientific explo-
2
| a
k
− b
k
−
1
− 1 |,
ration, while the phrase “internet explorer” has a
where B is a set of K bi-phrases, a
k
is the start
very different meaning.
position of the phrase in D that was translated
into the kth phrase in queried question, and
“CI TST”. To obtain the ground-truth of ques-
tion retrieval, we employ the Vector Space Model
b
k
−
1
is the end position of the phrase in D
(VSM) (Salton et al., 1975) to retrieve the top 20 re-
that was translated into the (k − 1)th phrase in
queried question. The feature, inspired by the
sults and obtain manual judgements. The top 20 re-
sults don’t include the queried question itself. Given
distortion model in SMT (Koehn et al., 2003),
models the degree to which the queried phrases
a returned result by VSM, an annotator is asked to
label it with “relevant” or “irrelevant”. If a returned
are reordered. For all possible B , we only
compute the feature value according to the
result is considered semantically equivalent to the
queried question, the annotator will label it as “rel-
Viterbi alignment, B ˆ = arg max
B
P(q, B | D).
evant”; otherwise, the annotator will label it as “ir-
We find B ˆ using the Viterbi algorithm, which is
relevant”. Two annotators are involved in the anno-
almost identical to the dynamic programming
tation process. If a conflict happens, a third person
recursion of equations (12) to (14), except that
will make judgement for the final result. In the pro-
the sum operator in equation (13) is replaced
with the max operator.
cess of manually judging questions, the annotators
are presented only the questions. Table 3 provides
• Unaligned word penalty features (UWP):
the statistics on the final test set.
Φ
U W P
(q, D), which is defined as the ratio be-
#queries #returned #relevanttween the number of unaligned words and the
CI TST 300 6,000 798total number of words in queried questions.
• Language model features (LM):
Table 3: Statistics on the Test DataΦ
LM
(q, D, A) = logP
LM
(q | D), where
We evaluate the performance of our approach us-
P
LM
(q|D) is the unigram language model
with Jelinek-Mercer smoothing defined by
ing Mean Average Precision (MAP). We perform
a significant test, i.e., a t-test with a default signif-
equations (1) and (2).
icant level of 0.05. Following the literature, we set
• Word translation features (WT):
the parameters λ = 0.2 (Cao et al., 2010) in equa-
Φ
W T