4.2 Prediction of New Testing Dataset
ated labeled training dataset X L of around 7600
Instead of using large dataset, we now use sum-
q/a pairs . We evaluated the performance of graph-
mary dataset with predicted labels, and local den-
based QA system using a set of 202 questions from
sity constraints to learn the class labels of nte
the TREC04 as testing dataset (Voorhees, 2003),
number of unseen data points, i.e., testing data
(Prager et al., 2000). We retrieved around 20 can-
points, X T e = {x 1 , ..., x nte }. Using graph-based
didate sentences for each of the 202 test questions
SSL method on the new representative dataset,
and manually labeled each q/a pair as true/false en-
X 0 = X ∪ X T e , which is comprised of sum-
tailment to compile 4037 test data.
p
marized dataset, X =
X i
To obtain more unlabeled training data X U ,
i=1 , as labeled data
points, and the testing dataset, X T e as unlabeled
we extracted around 100,000 document headlines
data points. Since we do not know estimated lo-
from a large newswire corpus. Instead of match-
cal density constraints of unlabeled data points, we
ing headline and first sentence of the document as
use constants to construct local density constraint
in (Harabagiu and Hickl, 2006), we followed a dif-
column vector for X 0 dataset as follows:
ferent approach. Using each headline as a query,
we retrieved around 20 top-ranked sentences from
δ 0 = {1 + δ i } p i=1 ∪ [1 ... 1] T ∈ < nte (11)
search engine. For each headline, we picked the
1st and the 20th retrieved sentences. Our assump-
0 < δ i ≤ 1. To embed the local density con-
tion is that the first retrieved sentence may have
straints, the second term in (3) is replaced with the
higher probability to entail the headline, whereas
constrained normalized Laplacian, L c = δ T Lδ,
the last one may have lower probability. Each of
these headline-candidate sentence pairs is used as
X
) 2 = f T L c f
w ij ( f i
q
p δ i 0 ∗ d i − f j
additional unlabeled q/a pair. Since each head-
δ 0 j ∗ d j
i,j∈L∪T
(12)
2https://traloihay.net
Features Model MRR Top1 Top5
to improve the MRR performance by about 22%.
Although the LexSem features have minimal se-
Baseline − 42.3% 32.7% 54.5%
mantic properties, they can improve MRR perfor-
QTCF SVM 51.9% 44.6% 63.4%
mance by 14%.
SSL 49.5% 43.1% 60.9%
LexSem SVM 48.2% 40.6% 61.4%
Experiment 2. To evaluate the performance of
SSL 47.9% 40.1% 58.4%
graph summarization we performed two separate
QComp SVM 54.2% 47.5% 64.3%
experiments. In the first part, we randomly se-
SSL 51.9% 45.5% 62.4%
lected subsets of labeled training dataset X L i ⊂
X L with different sample sizes, n i L ={1% ∗ n L ,
Table 1: MRR for different features and methods.
5% ∗ n L , 10% ∗ n L , 25% ∗ n L , 50% ∗ n L ,
100% ∗ n L } , where n L represents the sample size
of X L . At each random selection, the rest of the
line represents a converted question, in order to
labeled dataset is hypothetically used as unlabeled
extract the question-type feature, we use a match-
data to verify the performance of our SSL using
ing NER-type between the headline and candidate
different sizes of labeled data. Table 2 reports
sentence to set question-type NER match feature.
the MRR performance of QA system on testing
We applied pre-processing and feature extrac-
dataset using SVM and our graph-summary SSL
tion steps of section 2 to compile labeled and un-
(gSum SSL) method using the similarity function
labeled training and labeled testing datasets. We
in (1). In the second part of the experiment, we
use the rank scores obtained from the search en-
applied graph summarization on copula and non-
gine as baseline of our system. We present the
copula questions separately and merged obtained
performance of the models using Mean Recipro-
representative points to create labeled summary
cal Rank (MRR), top 1 (Top1) and top 5 predic-
dataset. Then using similarity function in (2) we
tion accuracies (Top5) as they are the most com-
applied SSL on labeled summary and unlabeled
monly used performance measures of QA systems
testing via transduction. We call these models as
(Voorhees, 2004). We performed manual iterative
Hybrid gSum SSL. To build SVM models in the
parameter optimization during training based on
same way, we separated the training dataset into
prediction accuracy to find the best k-nearest pa-
two based on copula and non-copula questions,
rameter for SSL, i.e., k = {3, 5, 10, 20, 50} , and
X cp , X ncp and re-run the SVM method separately.
2 −2 , .., 2 3
10 −2 , .., 10 2 and γ =
best C =
The testing dataset is divided into two accordingly.
for RBF kernel SVM. Next we describe three dif-
Predicted models from copula sentence datasets
ferent experiments and present individual results.
are applied on copula sentences of testing dataset
Graph summarization makes it feasible to exe-
and vice versa for non- copula sentences. The pre-
cute SSL on very large unlabeled datasets, which
dicted scores are combined to measure overall per-
was otherwise impossible. This paper has no as-
formance of Hybrid SVM models. We repeated
sumptions on the performance of the method in
the experiments five times with different random
comparison to other SSL methods.
samples and averaged the results.
Experiment 1. Here we test individual con-
tribution of each set of features on our QA sys-
Note from Table 2 that, when the number of
labeled data is small (n i L < 10% ∗ n L ), graph
tem. We applied SVM and our graph based SSL
based SSL, gSum SSL, has a better performance
method with no summarization to learn models
compared to SVM. As the percentage of labeled
using labeled training and testing datasets. For
SSL we used the training as labeled and testing
points in training data increase, the SVM perfor-
mance increases, however graph summary SSL is
as unlabeled dataset in transductive way to pre-
still comparable with SVM. On the other hand,
dict the entailment scores. The results are shown
in Table 1. From section 2.2, QTCF represents
when we build separate models for copula and
non-copula questions with different features, the
question-type NER match feature, LexSem is the
bundle of lexico-semantic features and QComp is
performance of the overall model significantly in-
the matching features of subject, head, object, and
creases in both methods. Especially in Hybrid
three complements. In comparison to the baseline,
graph-Summary SSL, Hybrid gSum SSL, when
QComp have a significant effect on the accuracy
the number of labeled data is small (n i L < 25% ∗
of the QA system. In addition, QTCF has shown
n L ) performance improvement is better than rest
% SVM gSum SSL Hybrid SVM Hybrid gSum SSL
#Labeled MRR Top1 Top5 MRR Top1 Top5 MRR Top1 Top5 MRR Top1 Top5
1% 45.2 33.2 65.8 56.1 44.6 72.8 51.6 40.1 70.8 59.7 47.0 75.2
5% 56.5 45.1 73.0 57.3 46.0 73.7 54.2 40.6 72.3 60.3 48.5 76.7
10% 59.3 47.5 76.7 57.9 46.5 74.2 57.7 47.0 74.2 60.4 48.5 77.2
25% 59.8 49.0 78.7 58.4 45.0 79.2 61.4 49.5 78.2 60.6 49.0 76.7
50% 60.9 48.0 80.7 58.9 45.5 79.2 62.2 51.0 79.7 61.3 50.0 77.2
100% 63.5 55.4 77.7 59.7 47.5 79.7 67.6 58.0 82.2 61.9 51.5 78.2
Table 2: The MRR (%) results of graph-summary SSL (gSum SSL) and SVM as well as Hybrid gSum
SSL and Hybrid SVM with different sizes of labeled data.
that the number of unlabeled data has positive ef-
#Unlabeled MRR Top1 Top5
fect on performance of graph summarization SSL.
25K 62.1% 52.0% 76.7%
50K 62.5% 52.5% 77.2%
100K 63.3% 54.0% 77.2%
Table 3: The effect of number of unlabeled data
6 Conclusions and Discussions
on MRR from Hybrid graph Summarization SSL.
In this paper, we applied a graph-based SSL al-
of the models. As more labeled data is introduced,
gorithm to improve the performance of QA task
Hybrid SVM models’ performance increase dras-
by exploiting unlabeled entailment relations be-
tically, even outperforming the state-of-the art
tween affirmed question and candidate sentence
MRR performance on TREC04 datasets presented
pairs. Our semantic and syntactic features for tex-
in (Shen and Klakow, 2006) i.e., MRR=67.0%,
tual entailment analysis has individually shown to
Top1=62.0%, Top5=74.0%. This is due to the fact
improve the performance of the QA compared to
that we establish two seperate entailment models
the baseline. We proposed a new graph repre-
for copula and non-copula q/a sentence pairs that
sentation for SSL that can represent textual en-
enables extracting useful information and better
tailment relations while embedding different ques-
representation of the specific data.
tion structures. We demonstrated that summariza-
Experiment 3. Although SSL methods are ca-
tion on graph-based SSL can improve the QA task
pable of exploiting information from unlabeled
performance when more unlabeled data is used to
data, learning becomes infeasible as the number
learn the classifier model.
of data points gets very large. There are vari-
There are several directions to improve our
ous research on SLL to overcome the usage of
work: (1) The results of our graph summarization
large number of unlabeled dataset challenge (De-
on very large unlabeled data is slightly less than
lalleau et al., 2006). Our graph summarization
best SVM results. This is largely due to using
method, Hybrid gsum SSL, has a different ap-
proach. which can summarize very large datasets
headlines instead of affirmed questions, wherein
headlines does not contain question-type and some
into representative data points and embed the orig-
of them are not in proper sentence form. This ad-
inal spatial information of data points, namely lo-
versely effects the named entity match of question-
cal density constraints, within the SSL summa-
type and the candidate sentence named entities as
rization schema. We demonstrate that as more la-
well as semantic match component feature extrac-
beled data is used, we would have a richer sum-
tion. We will investigate experiment 3 by using
mary dataset with additional spatial information
that would help to improve the the performance
real questions from different sources and construct
of the graph summary models. We gradually in-
different test datasets. (2) We will use other dis-
crease the number of unlabeled data samples as
tance measures to better explain entailment be-
shown in Table 3 to demonstrate the effects on the
tween q/a pairs and compare with other semi-
performance of testing dataset. The results show
supervised and transductive approaches.
References
Horacio Saggion and Robert Gaizauskas. 2006. Ex-
periments in passage selection and answer extrac-
Jinxiu Chen, Donghong Ji, C. Lim Tan, and Zhengyu
tion for question answering. In Advances in natural
Niu. 2006. Relation extraction using label propaga-
language processing, pages 291–302. Springer.
tion based semi-supervised learning. In Proceedings
of the ACL-2006.
Dan Shen and Dietrich Klakow. 2006. Exploring cor-
relation of dependency relation paths for answer ex-
Charles L.A. Clarke, Gordon V. Cormack, R. Thomas
traction. In Proceedings of ACL-2006.
Lynam, and Egidio L. Terra. 2006. Question an-
Vikas Sindhwani, Wei Chu, and S. Sathiya Keerthi.
swering by passage selection. In In: Advances in
open domain question answering, Strzalkowski, and
Bạn đang xem 4. - TÀI LIỆU BÁO CÁO KHOA HỌC: "A GRAPH-BASED SEMI-SUPERVISED LEARNING FOR QUESTION-ANSWERING" DOC