3 WORD-BASED TRANSLATION LANGUAGE MODELNALLY THESE PHRASES ARE PERMU...

2.3 Word-Based Translation Language Model

nally these phrases are permutated and concatenated

to form the queried questions q, where t and w de-

Xue et al. (2008) proposed to linearly mix two dif-

note the phrases or consecutive sequence of words.

ferent estimations by combining language model

To formulate this generative process, let E

and word-based translation model into a unified

denote the segmentation of D into K phrases

framework, called TransLM. The experiments show

t

1

, . . . , t

K

, and let F denote the K translation

that this model gains better performance than both

phrases w

1

, . . . , w

K

−we refer to these (t

i

, w

i

)

the language model and the word-based translation

pairs as bi-phrases. Finally, let M denote a permuta-

model. Following Xue et al. (2008), this model can

tion of K elements representing the final reordering

be written as:

step. Figure 1 describes an example of the genera-

(1−λ)P

mx

(w|D) +λP

ml

(w|C)Score(q, D) =

tive procedure.

w

q

Next let us place a probability distribution over

(5)

rewrite pairs. Let B (D, q) denote the set of E,

P(w|t)P

ml

(t|D)+(1−α)P

ml

(w|D)P

mx

(w|D) =α

4

In this paper, a document has the same meaning as a histor-

t

D

(6)ical question-answer pair in the Q&A archives.

F , M triples that translate D into q. Here we as-

consistent with A, which we denote as ˆ B(D, q, A). ˆ

sume a uniform probability over segmentations, so

Here, consistency requires that if two words are

aligned in A, then they must appear in the same bi- ˆ

the phrase-based translation model can be formu-

lated as:

phrase (t

i

, w

i

). Once the word alignment is fixed,

the final permutation is uniquely determined, so we

P(q|D)∝P(F|D, E)·P(M|D, E, F) (7)

can safely discard that factor. Thus equation (8) can

(E,F,M)

B(D,q)

As is common practice in SMT, we use the maxi-

P(q|D) maxP(F|D, E) (10)

(E,F,M)

B(D,q,

A)

ˆ

mum approximation to the sum:

For the sole remaining factor P (F|D, E), we

P(q|D)≈ maxP(F|D, E)·P(M|D, E, F) (8)

make the assumption that a segmented queried ques-

tion F = w

1

, . . . , w

K

is generated from left to

Although we have defined a generative model for

right by translating each phrase t

1

, . . . , t

K

indepen-

translating D into q, our goal is to calculate the rank-

dently:

ing score function over existing q and D, rather than

K

P(w

k

|t

k

) (11)P(F|D, E) =

generating new queried questions. Equation (8) can-

k=1

not be used directly for document ranking because

where P(w

k

|t

k

) is a phrase translation probability,

q and D are often of very different lengths, leav-

the estimation will be described in Section 3.3.

ing many words in D unaligned to any word in q.

To find the maximum probability assignment ef-

This is the key difference between the community-

ficiently, we use a dynamic programming approach,

based question retrieval and the general natural lan-

somewhat similar to the monotone decoding algo-

guage translation. As pointed out by Berger and Laf-

rithm described in (Och, 2002). We define α

j

to

ferty (1999) and Gao et al. (2010), document-query

be the probability of the most likely sequence of

translation requires a distillation of the document,

phrases covering the first j words in a queried ques-

while translation of natural language tolerates little

tion, then the probability can be calculated using the

being thrown away.

Thus we attempt to extract the key document

following recursion:

words that form the distillation of the document, and

(1) Initialization:

assume that a queried question is translated only

α

0

= 1 (12)

from the key document words. In this paper, the

key document words are identified via word align-

{

}

(2) Induction:

ment. We introduce the “hidden alignments” A =

α

j

= ∑α

j

P(w|t

w

)(13)

a

1

. . . a

j

. . . a

J

, which describe the mapping from a

j

<j,w=w

j′+1

...w

j

word position j in queried question to a document

word position i = a

j

. The different alignment mod-

(3) Total:

els we present provide different decompositions of

P(q|D) = α

J

(14)

P (q, A | D). We assume that the position of the key

document words are determined by the Viterbi align-