SECTION 1 INTRODUCES THE QUERY SNOWBALL (QSB)METHOD WHICH COMPUTES T...

3.2 Score Maximization Using Word Pairs

Figure 2: Co-occurrence Graph (Query Snowball)

Having determined the query relevance score, the

We represent this base word score by s

b

(w) =

next step is to define the summary score. To this end,

log(N/ctf (w)) or s

b

(w) = log(N/n(w)), where

we use word pairs rather than individual words as the

ctf (w) is the total number of occurrences of w

basic unit. This is because word pairs are more in-

within the corpus and n(w) is the document fre-

formative for discriminating across different pieces

of information than single common words. (Re-

quency of w, and N is the total number of docu-

call the example mentioned in Section 1) Thus, the

ments in the corpus. We will refer to these two ver-

sions as itf and idf, respectively. Our second clue

word pair score is simply defined as: s

p

(w

1

, w

2

) =

is the weight propagated from the center of the co-

s

r

(w

1

)s

r

(w

2

) and the summary score is computed

occurence graph shown in Figure 1. Below, we de-

as:

f

QSBP

(S) = ∑

s

p

(w

1

, w

2

) (3)

scribe how to compute the word scores for words in

{w

1

,w

2

|w

1

6=w

2

and

w

1

,w

2

∈u

and

u∈S}

R1 and then those for words in R2.

As Figure 2 suggests, the query relevance score

where u is a textual unit, which in our case is a

for r1 R1 is computed based not only on its base

sentence. Our problem then is to select S to maxi-

word score but also on the relationship between r1

mize f

QSBP

(S). The above function based on word

and q Q. To be more specific, let f req(w, w

0

)

pairs is still submodular, and therefore we can apply

denote the within-sentence co-occurrence frequency

a greedy approximate algorithm with performance

for words w and w

0

, and let distance (w, w

0

) denote

guarantee as proposed in previous work (Khuller

the minimum dependency distance between w and

et al., 1999; Takamura and Okumura, 2009a). Let

w

0

: A dependency distance is the path length be-

l(u) denote the length of u. Given a set of source

tween nodes w and w

0

within a dependency parse

documents D and a length limit L for a sum-

tree; the minimum dependency distance is the short-

mary,

est path length among all dependency parse trees of

Require: D, L

source-document sentences in which w and w

0

co-