4 RANKING CANDIDATE HISTORICAL QUESTIONS1 STUFFY NOSE INTERNET EXPLO...

Question

3.4 Ranking Candidate Historical Questions

1 stuffy nose internet explorer

Unlike the word-based translation models, the

2 cold ie

phrase-based translation model cannot be interpo-

3 stuffy internet browser

lated with a unigram language model. Following

4 sore throat explorer

(Sun et al., 2010; Gao et al., 2010), we resort to

5 sneeze browser

a linear ranking framework for question retrieval in

Table 2: Phrase translation probability examples. Each

which different models are incorporated as features.

column shows the top 5 target phrases learned from the

We consider learning a relevance function of the

word-aligned question-answer pairs.

following general, linear form:

Score(q, D) =θ

^T

·Φ(q, D) (19)

useful for contextual lexical selection with sufficient

where the feature vector Φ(q, D) is an arbitrary

training data, but can be subject to data sparsity is-

function that maps (q, D) to a real value, i.e.,

sues (Sun et al., 2010; Gao et al., 2010). An alter-

Φ(q, D) ∈ R . θ is the corresponding weight vec-

nate translation probability estimate not subject to

tor, we optimize this parameter for our evaluation

data sparsity is the so-called lexical weight estimate

metrics directly using the Powell Search algorithm

(Koehn et al., 2003). Let P (w|t) be the word-to-

(Paul et al., 1992) via cross-validation.

word translation probability, and let A be the word

The features used in this paper are as follows:

alignment between w and t. Here, the word align-

ment contains (i, j) pairs, where i ∈ 1 . . . |w| and

• Phrase translation features (PT):

j ∈ 0 . . . | t | , with 0 indicating a null word. Then we

Φ

P T

(q, D, A) = logP (q | D), where P (q | D)

use the following estimate:

is computed using equations (12) to (15), and

the phrase translation probability P(w | t) is

|

w

|

∑∏

estimated using equation (17).

1P(w

i

|t

j

)P

t

(w|t, A) =|{j|(j, i)∈A}|

• Inverted Phrase translation features (IPT):

∀

(i,j)

∈

A

i=1

(18)

Φ

_{IP T}

(D, q, A) = logP (D | q), where P (D | q)

We assume that for each position in w, there is ei-

is computed using equations (12) to (15) ex-

ther a single alignment to 0, or multiple alignments

cept that we set µ

2

= 0 in equation (15), and

to non-zero positions in t. In fact, equation (18)

the phrase translation probability P (w | t) is es-

computes a product of per-word translation scores;

timated using equation (17).

the per-word scores are the averages of all the trans-

• Lexical weight feature (LW):

lations for the alignment links of that word. The

Φ

LW

(q, D, A) = logP(q|D), here P (q|D)

word translation probabilities are calculated using

is computed by equations (12) to (15), and the

IBM 1, which has been widely used for question re-

phrase translation probability is computed as

trieval (Jeon et al., 2005; Xue et al., 2008; Lee et al.,

lexical weight according to equation (18).

2008; Bernhard and Gurevych, 2009). These word-

based scores of bi-phrases, though not as effective

• Inverted Lexical weight feature (ILW):

in contextual selection, are more robust to noise and

Φ

ILW

(D, q, A) = logP (D | q), here P (D | q)

sparsity.

is computed by equations (12) to (15) except

A sample of the resulting phrase translation ex-

that we set µ

2

= 0 in equation (15), and the

amples is shown in Table 2, where the top 5 target

phrases are translated from the source phrases ac-

cording to the phrase-based translation model. For

• Phrase alignment features (PA):

example, the term “explorer” used alone, most likely

Φ

_{P A}

(q, D, B) = ∑

_K

refers to a person who engages in scientific explo-

2

| a

_k

− b

_k

₋

₁

− 1 |,

ration, while the phrase “internet explorer” has a

where B is a set of K bi-phrases, a

_k

is the start

very different meaning.

position of the phrase in D that was translated

into the kth phrase in queried question, and

“CI TST”. To obtain the ground-truth of ques-

tion retrieval, we employ the Vector Space Model

b

_k

₋

₁

is the end position of the phrase in D

(VSM) (Salton et al., 1975) to retrieve the top 20 re-

that was translated into the (k − 1)th phrase in

queried question. The feature, inspired by the

sults and obtain manual judgements. The top 20 re-

sults don’t include the queried question itself. Given

distortion model in SMT (Koehn et al., 2003),

models the degree to which the queried phrases

a returned result by VSM, an annotator is asked to

label it with “relevant” or “irrelevant”. If a returned

are reordered. For all possible B , we only

compute the feature value according to the

result is considered semantically equivalent to the

queried question, the annotator will label it as “rel-

Viterbi alignment, B ˆ = arg max

_B

P(q, B | D).

evant”; otherwise, the annotator will label it as “ir-

We find B ˆ using the Viterbi algorithm, which is

relevant”. Two annotators are involved in the anno-

almost identical to the dynamic programming

tation process. If a conflict happens, a third person

recursion of equations (12) to (14), except that

will make judgement for the final result. In the pro-

the sum operator in equation (13) is replaced

with the max operator.

cess of manually judging questions, the annotators

are presented only the questions. Table 3 provides

• Unaligned word penalty features (UWP):

the statistics on the final test set.

Φ

_{U W P}

(q, D), which is defined as the ratio be-

#queries #returned #relevant

tween the number of unaligned words and the

CI TST 300 6,000 798

total number of words in queried questions.

• Language model features (LM):

Table 3: Statistics on the Test Data

Φ

_LM

(q, D, A) = logP

_LM

(q | D), where

We evaluate the performance of our approach us-

P

LM

(q|D) is the unigram language model

with Jelinek-Mercer smoothing defined by

ing Mean Average Precision (MAP). We perform

a significant test, i.e., a t-test with a default signif-

equations (1) and (2).

icant level of 0.05. Following the literature, we set

• Word translation features (WT):

the parameters λ = 0.2 (Cao et al., 2010) in equa-

Φ

W T