2 PREDICTION OF NEW TESTING DATASETATED LABELED TRAINING DATASET X L...

4.2 Prediction of New Testing Dataset

ated labeled training dataset X L of around 7600

Instead of using large dataset, we now use sum-

q/a pairs . We evaluated the performance of graph-

mary dataset with predicted labels, and local den-

based QA system using a set of 202 questions from

sity constraints to learn the class labels of nte

the TREC04 as testing dataset (Voorhees, 2003),

number of unseen data points, i.e., testing data

(Prager et al., 2000). We retrieved around 20 can-

points, X T e = {x 1 , ..., x nte }. Using graph-based

didate sentences for each of the 202 test questions

SSL method on the new representative dataset,

and manually labeled each q/a pair as true/false en-

X 0 = X ∪ X T e , which is comprised of sum-

tailment to compile 4037 test data.

p

marized dataset, X =

X i

To obtain more unlabeled training data X U ,

i=1 , as labeled data

points, and the testing dataset, X T e as unlabeled

we extracted around 100,000 document headlines

data points. Since we do not know estimated lo-

from a large newswire corpus. Instead of match-

cal density constraints of unlabeled data points, we

ing headline and first sentence of the document as

use constants to construct local density constraint

in (Harabagiu and Hickl, 2006), we followed a dif-

column vector for X 0 dataset as follows:

ferent approach. Using each headline as a query,

we retrieved around 20 top-ranked sentences from

δ 0 = {1 + δ i } p i=1 ∪ [1 ... 1] T ∈ < nte (11)

search engine. For each headline, we picked the

1st and the 20th retrieved sentences. Our assump-

0 < δ i ≤ 1. To embed the local density con-

tion is that the first retrieved sentence may have

straints, the second term in (3) is replaced with the

higher probability to entail the headline, whereas

constrained normalized Laplacian, L c = δ T Lδ,

the last one may have lower probability. Each of

these headline-candidate sentence pairs is used as

X

) 2 = f T L c f

w ij ( f i

q

p δ i 0 ∗ d i − f j

additional unlabeled q/a pair. Since each head-

δ 0 j ∗ d j

i,j∈L∪T

(12)

2

https://traloihay.net

Features Model MRR Top1 Top5

to improve the MRR performance by about 22%.

Although the LexSem features have minimal se-

Baseline − 42.3% 32.7% 54.5%

mantic properties, they can improve MRR perfor-

QTCF SVM 51.9% 44.6% 63.4%

mance by 14%.

SSL 49.5% 43.1% 60.9%

LexSem SVM 48.2% 40.6% 61.4%

Experiment 2. To evaluate the performance of

SSL 47.9% 40.1% 58.4%

graph summarization we performed two separate

QComp SVM 54.2% 47.5% 64.3%

experiments. In the first part, we randomly se-

SSL 51.9% 45.5% 62.4%

lected subsets of labeled training dataset X L i

X L with different sample sizes, n i L ={1% ∗ n L ,

Table 1: MRR for different features and methods.

5% ∗ n L , 10% ∗ n L , 25% ∗ n L , 50% ∗ n L ,

100% ∗ n L } , where n L represents the sample size

of X L . At each random selection, the rest of the

line represents a converted question, in order to

labeled dataset is hypothetically used as unlabeled

extract the question-type feature, we use a match-

data to verify the performance of our SSL using

ing NER-type between the headline and candidate

different sizes of labeled data. Table 2 reports

sentence to set question-type NER match feature.

the MRR performance of QA system on testing

We applied pre-processing and feature extrac-

dataset using SVM and our graph-summary SSL

tion steps of section 2 to compile labeled and un-

(gSum SSL) method using the similarity function

labeled training and labeled testing datasets. We

in (1). In the second part of the experiment, we

use the rank scores obtained from the search en-

applied graph summarization on copula and non-

gine as baseline of our system. We present the

copula questions separately and merged obtained

performance of the models using Mean Recipro-

representative points to create labeled summary

cal Rank (MRR), top 1 (Top1) and top 5 predic-

dataset. Then using similarity function in (2) we

tion accuracies (Top5) as they are the most com-

applied SSL on labeled summary and unlabeled

monly used performance measures of QA systems

testing via transduction. We call these models as

(Voorhees, 2004). We performed manual iterative

Hybrid gSum SSL. To build SVM models in the

parameter optimization during training based on

same way, we separated the training dataset into

prediction accuracy to find the best k-nearest pa-

two based on copula and non-copula questions,

rameter for SSL, i.e., k = {3, 5, 10, 20, 50} , and

X cp , X ncp and re-run the SVM method separately.

2 −2 , .., 2 3

10 −2 , .., 10 2 and γ =

best C =

The testing dataset is divided into two accordingly.

for RBF kernel SVM. Next we describe three dif-

Predicted models from copula sentence datasets

ferent experiments and present individual results.

are applied on copula sentences of testing dataset

Graph summarization makes it feasible to exe-

and vice versa for non- copula sentences. The pre-

cute SSL on very large unlabeled datasets, which

dicted scores are combined to measure overall per-

was otherwise impossible. This paper has no as-

formance of Hybrid SVM models. We repeated

sumptions on the performance of the method in

the experiments five times with different random

comparison to other SSL methods.

samples and averaged the results.

Experiment 1. Here we test individual con-

tribution of each set of features on our QA sys-

Note from Table 2 that, when the number of

labeled data is small (n i L < 10% ∗ n L ), graph

tem. We applied SVM and our graph based SSL

based SSL, gSum SSL, has a better performance

method with no summarization to learn models

compared to SVM. As the percentage of labeled

using labeled training and testing datasets. For

SSL we used the training as labeled and testing

points in training data increase, the SVM perfor-

mance increases, however graph summary SSL is

as unlabeled dataset in transductive way to pre-

still comparable with SVM. On the other hand,

dict the entailment scores. The results are shown

in Table 1. From section 2.2, QTCF represents

when we build separate models for copula and

non-copula questions with different features, the

question-type NER match feature, LexSem is the

bundle of lexico-semantic features and QComp is

performance of the overall model significantly in-

the matching features of subject, head, object, and

creases in both methods. Especially in Hybrid

three complements. In comparison to the baseline,

graph-Summary SSL, Hybrid gSum SSL, when

QComp have a significant effect on the accuracy

the number of labeled data is small (n i L < 25% ∗

of the QA system. In addition, QTCF has shown

n L ) performance improvement is better than rest

% SVM gSum SSL Hybrid SVM Hybrid gSum SSL

#Labeled MRR Top1 Top5 MRR Top1 Top5 MRR Top1 Top5 MRR Top1 Top5

1% 45.2 33.2 65.8 56.1 44.6 72.8 51.6 40.1 70.8 59.7 47.0 75.2

5% 56.5 45.1 73.0 57.3 46.0 73.7 54.2 40.6 72.3 60.3 48.5 76.7

10% 59.3 47.5 76.7 57.9 46.5 74.2 57.7 47.0 74.2 60.4 48.5 77.2

25% 59.8 49.0 78.7 58.4 45.0 79.2 61.4 49.5 78.2 60.6 49.0 76.7

50% 60.9 48.0 80.7 58.9 45.5 79.2 62.2 51.0 79.7 61.3 50.0 77.2

100% 63.5 55.4 77.7 59.7 47.5 79.7 67.6 58.0 82.2 61.9 51.5 78.2

Table 2: The MRR (%) results of graph-summary SSL (gSum SSL) and SVM as well as Hybrid gSum

SSL and Hybrid SVM with different sizes of labeled data.

that the number of unlabeled data has positive ef-

#Unlabeled MRR Top1 Top5

fect on performance of graph summarization SSL.

25K 62.1% 52.0% 76.7%

50K 62.5% 52.5% 77.2%

100K 63.3% 54.0% 77.2%

Table 3: The effect of number of unlabeled data

6 Conclusions and Discussions

on MRR from Hybrid graph Summarization SSL.

In this paper, we applied a graph-based SSL al-

of the models. As more labeled data is introduced,

gorithm to improve the performance of QA task

Hybrid SVM models’ performance increase dras-

by exploiting unlabeled entailment relations be-

tically, even outperforming the state-of-the art

tween affirmed question and candidate sentence

MRR performance on TREC04 datasets presented

pairs. Our semantic and syntactic features for tex-

in (Shen and Klakow, 2006) i.e., MRR=67.0%,

tual entailment analysis has individually shown to

Top1=62.0%, Top5=74.0%. This is due to the fact

improve the performance of the QA compared to

that we establish two seperate entailment models

the baseline. We proposed a new graph repre-

for copula and non-copula q/a sentence pairs that

sentation for SSL that can represent textual en-

enables extracting useful information and better

tailment relations while embedding different ques-

representation of the specific data.

tion structures. We demonstrated that summariza-

Experiment 3. Although SSL methods are ca-

tion on graph-based SSL can improve the QA task

pable of exploiting information from unlabeled

performance when more unlabeled data is used to

data, learning becomes infeasible as the number

learn the classifier model.

of data points gets very large. There are vari-

There are several directions to improve our

ous research on SLL to overcome the usage of

work: (1) The results of our graph summarization

large number of unlabeled dataset challenge (De-

on very large unlabeled data is slightly less than

lalleau et al., 2006). Our graph summarization

best SVM results. This is largely due to using

method, Hybrid gsum SSL, has a different ap-

proach. which can summarize very large datasets

headlines instead of affirmed questions, wherein

headlines does not contain question-type and some

into representative data points and embed the orig-

of them are not in proper sentence form. This ad-

inal spatial information of data points, namely lo-

versely effects the named entity match of question-

cal density constraints, within the SSL summa-

type and the candidate sentence named entities as

rization schema. We demonstrate that as more la-

well as semantic match component feature extrac-

beled data is used, we would have a richer sum-

tion. We will investigate experiment 3 by using

mary dataset with additional spatial information

that would help to improve the the performance

real questions from different sources and construct

of the graph summary models. We gradually in-

different test datasets. (2) We will use other dis-

crease the number of unlabeled data samples as

tance measures to better explain entailment be-

shown in Table 3 to demonstrate the effects on the

tween q/a pairs and compare with other semi-

performance of testing dataset. The results show

supervised and transductive approaches.

References

Horacio Saggion and Robert Gaizauskas. 2006. Ex-

periments in passage selection and answer extrac-

Jinxiu Chen, Donghong Ji, C. Lim Tan, and Zhengyu

tion for question answering. In Advances in natural

Niu. 2006. Relation extraction using label propaga-

language processing, pages 291–302. Springer.

tion based semi-supervised learning. In Proceedings

of the ACL-2006.

Dan Shen and Dietrich Klakow. 2006. Exploring cor-

relation of dependency relation paths for answer ex-

Charles L.A. Clarke, Gordon V. Cormack, R. Thomas

traction. In Proceedings of ACL-2006.

Lynam, and Egidio L. Terra. 2006. Question an-

Vikas Sindhwani, Wei Chu, and S. Sathiya Keerthi.

swering by passage selection. In In: Advances in

open domain question answering, Strzalkowski, and