OBTAIN SUMMARY DATASET X = I IS THE LABEL OF THE SELECTED BOUNDARY B I...

8: Obtain summary dataset X =

i

is the label of the selected boundary B i s .

s=1

=

i=1

and

We identify the edge weights w ij s between each

local density constrains, δ = {δ

i

}

pi=1

.

node in the boundary B i s via (1), thus the bound-

ary is connected. We calculate the weighted av-

After all data points are evaluated, the sample

erage of the vertices to obtain the representative

dataset X s can now be represented with the sum-

summary node of B i s as shown in Figure 1-(c);

mary representative vertices as

P nm

i6=j=1 1

2 w ij s (x s i + x s j )

X s B

i

=

X s =

X s B

1

, ..., X s B

nb

. (9)

i6=j=1 w ij s (8)

The boundaries of some nodes may only con-

and corresponding local density constraints as,

tain themselves because their immediate neigh-

bors may have opposite class labels. Similarly

δ s = {δ s 1 , ..., δ s nb } T , 0 < δ i s ≤ 1 (10)

The summarization algorithm is repeated for each

If any testing vector has an edge between a labeled

random subset X s , s = 1, ..., q of very large

vector, then with the usage of the local density

dataset X = X L ∪ X U , see Algorithm 1. As

constraints, the edge weights will not not only be

a result q number of summary datasets X s each

affected by that labeled node, but also how dense

that node is within that part of the graph.

of which with nb labeled data points are com-

bined to form a representative sample of X, X =

5 Experiments

X s q s=1 reducing the number of data from n to

a much smaller number of data, p = q ∗ nb n.

We demonstrate the results from three sets of ex-

So the new summary of the X can be represented

periments to explore how our graph representa-

with X =

X i p i=1 . For example, an origi-

tion, which encodes textual entailment informa-

nal dataset with 1M data points can be divided

tion, can be used to improve the performance of

up to q = 50 random samples of m = 5000

the QA systems. We show that as we increase

data points each. Then using graph summariza-

the number of unlabeled data, with our graph-

tion each summarized dataset may be represented

summarization, it is feasible to extract information

with nb ∼ = 500 data points. After merging sum-

that can improve the performance of QA models.

marized data, final summarized samples compile

We performed experiments on a set of 1449

to 500 ∗ 50 ∼ = 25K 1M data points, reduced to

questions from TREC-99-03. Using the search en-