OBTAIN SUMMARY DATASET X = I IS THE LABEL OF THE SELECTED BOUNDARY B I...

Question

8: Obtain summary dataset X = i is the label of the selected boundary B i s .s=1= i=1andWe identify the edge weights w ij s between eachlocal density constrains, δ = {δi}pi=1.node in the boundary B i s via (1), thus the bound-ary is connected. We calculate the weighted av-After all data points are evaluated, the sampleerage of the vertices to obtain the representativedataset X s can now be represented with the sum-summary node of B i s as shown in Figure 1-(c);mary representative vertices asP nmi6=j=1 12 w ij s (x s i + x s j )X s Bi =X s = X s B1, ..., X s Bnb . (9)i6=j=1 w ij s (8)The boundaries of some nodes may only con-and corresponding local density constraints as,tain themselves because their immediate neigh-bors may have opposite class labels. Similarlyδ s = {δ s 1 , ..., δ s nb } T , 0 < δ i s ≤ 1 (10)The summarization algorithm is repeated for eachIf any testing vector has an edge between a labeledrandom subset X s , s = 1, ..., q of very largevector, then with the usage of the local densitydataset X = X L ∪ X U , see Algorithm 1. Asconstraints, the edge weights will not not only bea result q number of summary datasets X s eachaffected by that labeled node, but also how densethat node is within that part of the graph.of which with nb labeled data points are com-bined to form a representative sample of X, X =5 Experiments X s q s=1 reducing the number of data from n toa much smaller number of data, p = q ∗ nb n.We demonstrate the results from three sets of ex-So the new summary of the X can be representedperiments to explore how our graph representa-with X = X i p i=1 . For example, an origi-tion, which encodes textual entailment informa-nal dataset with 1M data points can be dividedtion, can be used to improve the performance ofup to q = 50 random samples of m = 5000the QA systems. We show that as we increasedata points each. Then using graph summariza-the number of unlabeled data, with our graph-tion each summarized dataset may be representedsummarization, it is feasible to extract informationwith nb ∼ = 500 data points. After merging sum-that can improve the performance of QA models.marized data, final summarized samples compileWe performed experiments on a set of 1449to 500 ∗ 50 ∼ = 25K 1M data points, reduced toquestions from TREC-99-03. Using the search en-

OBTAIN SUMMARY DATASET X = I IS THE LABEL OF THE SELECTED BOUNDARY B I...

8: Obtain summary dataset X =

is the label of the selected boundary B i s .

=

and

We identify the edge weights w ij s between each

local density constrains, δ = {δ

}

.

node in the boundary B i s via (1), thus the bound-

ary is connected. We calculate the weighted av-

After all data points are evaluated, the sample

erage of the vertices to obtain the representative

dataset X s can now be represented with the sum-

summary node of B i s as shown in Figure 1-(c);

mary representative vertices as

P nm

i6=j=1 1

2 w ij s (x s i + x s j )

X s B

=

X s =

X s B

, ..., X s B

. (9)

i6=j=1 w ij s (8)

The boundaries of some nodes may only con-

and corresponding local density constraints as,

tain themselves because their immediate neigh-

bors may have opposite class labels. Similarly

δ s = {δ s 1 , ..., δ s nb } T , 0 < δ i s ≤ 1 (10)

The summarization algorithm is repeated for each

If any testing vector has an edge between a labeled

random subset X s , s = 1, ..., q of very large

vector, then with the usage of the local density

dataset X = X L ∪ X U , see Algorithm 1. As

constraints, the edge weights will not not only be

a result q number of summary datasets X s each

affected by that labeled node, but also how dense

that node is within that part of the graph.

of which with nb labeled data points are com-

bined to form a representative sample of X, X =

5 Experiments

X s q s=1 reducing the number of data from n to

a much smaller number of data, p = q ∗ nb n.

We demonstrate the results from three sets of ex-

So the new summary of the X can be represented

periments to explore how our graph representa-

with X =

X i p i=1 . For example, an origi-

tion, which encodes textual entailment informa-

nal dataset with 1M data points can be divided

tion, can be used to improve the performance of

up to q = 50 random samples of m = 5000

the QA systems. We show that as we increase

data points each. Then using graph summariza-

the number of unlabeled data, with our graph-

tion each summarized dataset may be represented

summarization, it is feasible to extract information

with nb ∼ = 500 data points. After merging sum-

that can improve the performance of QA models.

marized data, final summarized samples compile

We performed experiments on a set of 1449

to 500 ∗ 50 ∼ = 25K 1M data points, reduced to

questions from TREC-99-03. Using the search en-

Bạn đang xem 8: - TÀI LIỆU BÁO CÁO KHOA HỌC: "A GRAPH-BASED SEMI-SUPERVISED LEARNING FOR QUESTION-ANSWERING" DOC

is the label of the selected boundary B _i ^s .

We identify the edge weights w _ij ^s between each

node in the boundary B _i ^s via (1), thus the bound-

dataset X ^s can now be represented with the sum-

summary node of B _i ^s as shown in Figure 1-(c);

2 w _ij ^s (x ^s _i + x ^s _j )

X ^s _B

X ^s =

X ^s _B

, ..., X ^s _B

i6=j=1 w _ij ^s (8)

δ ^s = {δ ^s ₁ , ..., δ ^s _nb } ^T , 0 < δ _i ^s ≤ 1 (10)

random subset X ^s , s = 1, ..., q of very large

dataset X = X _L ∪ X _U , see Algorithm 1. As

a result q number of summary datasets X ^s each

X ^s ^q _s=1 reducing the number of data from n to

X _i ^p _i=1 . For example, an origi-