SECTION 1 INTRODUCES THE QUERY SNOWBALL (QSB)METHOD WHICH COMPUTES T...
3.2 Score Maximization Using Word Pairs
Figure 2: Co-occurrence Graph (Query Snowball)Having determined the query relevance score, the
We represent this base word score by s
b
(w) =
next step is to define the summary score. To this end,
log(N/ctf (w)) or s
b
(w) = log(N/n(w)), where
we use word pairs rather than individual words as the
ctf (w) is the total number of occurrences of w
basic unit. This is because word pairs are more in-
within the corpus and n(w) is the document fre-
formative for discriminating across different pieces
of information than single common words. (Re-
quency of w, and N is the total number of docu-
call the example mentioned in Section 1) Thus, the
ments in the corpus. We will refer to these two ver-
sions as itf and idf, respectively. Our second clue
word pair score is simply defined as: s
p
(w
1
, w
2
) =
is the weight propagated from the center of the co-
s
r
(w
1
)s
r
(w
2
) and the summary score is computed
occurence graph shown in Figure 1. Below, we de-
as:
f
QSBP
(S) = ∑
s
p
(w
1
, w
2
) (3)
scribe how to compute the word scores for words in
{w
1
,w
2
|w
1
6=w
2
and
w
1
,w
2
∈u
and
u∈S}
R1 and then those for words in R2.
As Figure 2 suggests, the query relevance score
where u is a textual unit, which in our case is a
for r1 ∈ R1 is computed based not only on its base
sentence. Our problem then is to select S to maxi-
word score but also on the relationship between r1
mize f
QSBP
(S). The above function based on word
and q ∈ Q. To be more specific, let f req(w, w
0
)
pairs is still submodular, and therefore we can apply
denote the within-sentence co-occurrence frequency
a greedy approximate algorithm with performance
for words w and w
0
, and let distance (w, w
0
) denote
guarantee as proposed in previous work (Khuller
the minimum dependency distance between w and
et al., 1999; Takamura and Okumura, 2009a). Let
w
0
: A dependency distance is the path length be-
l(u) denote the length of u. Given a set of source
tween nodes w and w
0
within a dependency parse
documents D and a length limit L for a sum-
tree; the minimum dependency distance is the short-
mary,
est path length among all dependency parse trees of
Require: D, L
source-document sentences in which w and w
0