8: Obtain summary dataset X =
i is the label of the selected boundary B i s .
s=1=
i=1and
We identify the edge weights w ij s between each
local density constrains, δ = {δ
i}
pi=1.
node in the boundary B i s via (1), thus the bound-
ary is connected. We calculate the weighted av-
After all data points are evaluated, the sample
erage of the vertices to obtain the representative
dataset X s can now be represented with the sum-
summary node of B i s as shown in Figure 1-(c);
mary representative vertices as
P nm
i6=j=1 1
2 w ij s (x s i + x s j )
X s B
i =
X s =
X s B
1, ..., X s B
nb . (9)
i6=j=1 w ij s (8)
The boundaries of some nodes may only con-
and corresponding local density constraints as,
tain themselves because their immediate neigh-
bors may have opposite class labels. Similarly
δ s = {δ s 1 , ..., δ s nb } T , 0 < δ i s ≤ 1 (10)
The summarization algorithm is repeated for each
If any testing vector has an edge between a labeled
random subset X s , s = 1, ..., q of very large
vector, then with the usage of the local density
dataset X = X L ∪ X U , see Algorithm 1. As
constraints, the edge weights will not not only be
a result q number of summary datasets X s each
affected by that labeled node, but also how dense
that node is within that part of the graph.
of which with nb labeled data points are com-
bined to form a representative sample of X, X =
5 Experiments
X s q s=1 reducing the number of data from n to
a much smaller number of data, p = q ∗ nb n.
We demonstrate the results from three sets of ex-
So the new summary of the X can be represented
periments to explore how our graph representa-
with X =
X i p i=1 . For example, an origi-
tion, which encodes textual entailment informa-
nal dataset with 1M data points can be divided
tion, can be used to improve the performance of
up to q = 50 random samples of m = 5000
the QA systems. We show that as we increase
data points each. Then using graph summariza-
the number of unlabeled data, with our graph-
tion each summarized dataset may be represented
summarization, it is feasible to extract information
with nb ∼ = 500 data points. After merging sum-
that can improve the performance of QA models.
marized data, final summarized samples compile
We performed experiments on a set of 1449
to 500 ∗ 50 ∼ = 25K 1M data points, reduced to
questions from TREC-99-03. Using the search en-
Bạn đang xem 8: - TÀI LIỆU BÁO CÁO KHOA HỌC: "A GRAPH-BASED SEMI-SUPERVISED LEARNING FOR QUESTION-ANSWERING" DOC