2 PREDICTION OF NEW TESTING DATASETATED LABELED TRAINING DATASET X L...

Question

4.2 Prediction of New Testing Datasetated labeled training dataset X L of around 7600Instead of using large dataset, we now use sum-q/a pairs . We evaluated the performance of graph-mary dataset with predicted labels, and local den-based QA system using a set of 202 questions fromsity constraints to learn the class labels of ntethe TREC04 as testing dataset (Voorhees, 2003),number of unseen data points, i.e., testing data(Prager et al., 2000). We retrieved around 20 can-points, X T e = {x 1 , ..., x nte }. Using graph-baseddidate sentences for each of the 202 test questionsSSL method on the new representative dataset,and manually labeled each q/a pair as true/false en-X 0 = X ∪ X T e , which is comprised of sum-tailment to compile 4037 test data.pmarized dataset, X = X iTo obtain more unlabeled training data X U ,i=1 , as labeled datapoints, and the testing dataset, X T e as unlabeledwe extracted around 100,000 document headlinesdata points. Since we do not know estimated lo-from a large newswire corpus. Instead of match-cal density constraints of unlabeled data points, weing headline and first sentence of the document asuse constants to construct local density constraintin (Harabagiu and Hickl, 2006), we followed a dif-column vector for X 0 dataset as follows:ferent approach. Using each headline as a query,we retrieved around 20 top-ranked sentences fromδ 0 = {1 + δ i } p i=1 ∪ [1 ... 1] T ∈ < nte (11)search engine. For each headline, we picked the1st and the 20th retrieved sentences. Our assump-0 < δ i ≤ 1. To embed the local density con-tion is that the first retrieved sentence may havestraints, the second term in (3) is replaced with thehigher probability to entail the headline, whereasconstrained normalized Laplacian, L c = δ T Lδ,the last one may have lower probability. Each ofthese headline-candidate sentence pairs is used asX) 2 = f T L c fw ij ( f iqp δ i 0 ∗ d i − f jadditional unlabeled q/a pair. Since each head-δ 0 j ∗ d ji,j∈L∪T(12)2https://traloihay.netFeatures Model MRR Top1 Top5to improve the MRR performance by about 22%.Although the LexSem features have minimal se-Baseline − 42.3% 32.7% 54.5%mantic properties, they can improve MRR perfor-QTCF SVM 51.9% 44.6% 63.4%mance by 14%.SSL 49.5% 43.1% 60.9%LexSem SVM 48.2% 40.6% 61.4%Experiment 2. To evaluate the performance ofSSL 47.9% 40.1% 58.4%graph summarization we performed two separateQComp SVM 54.2% 47.5% 64.3%experiments. In the first part, we randomly se-SSL 51.9% 45.5% 62.4%lected subsets of labeled training dataset X L i ⊂X L with different sample sizes, n i L ={1% ∗ n L ,Table 1: MRR for different features and methods.5% ∗ n L , 10% ∗ n L , 25% ∗ n L , 50% ∗ n L ,100% ∗ n L } , where n L represents the sample sizeof X L . At each random selection, the rest of theline represents a converted question, in order tolabeled dataset is hypothetically used as unlabeledextract the question-type feature, we use a match-data to verify the performance of our SSL usinging NER-type between the headline and candidatedifferent sizes of labeled data. Table 2 reportssentence to set question-type NER match feature.the MRR performance of QA system on testingWe applied pre-processing and feature extrac-dataset using SVM and our graph-summary SSLtion steps of section 2 to compile labeled and un-(gSum SSL) method using the similarity functionlabeled training and labeled testing datasets. Wein (1). In the second part of the experiment, weuse the rank scores obtained from the search en-applied graph summarization on copula and non-gine as baseline of our system. We present thecopula questions separately and merged obtainedperformance of the models using Mean Recipro-representative points to create labeled summarycal Rank (MRR), top 1 (Top1) and top 5 predic-dataset. Then using similarity function in (2) wetion accuracies (Top5) as they are the most com-applied SSL on labeled summary and unlabeledmonly used performance measures of QA systemstesting via transduction. We call these models as(Voorhees, 2004). We performed manual iterativeHybrid gSum SSL. To build SVM models in theparameter optimization during training based onsame way, we separated the training dataset intoprediction accuracy to find the best k-nearest pa-two based on copula and non-copula questions,rameter for SSL, i.e., k = {3, 5, 10, 20, 50} , andX cp , X ncp and re-run the SVM method separately.2 −2 , .., 2 310 −2 , .., 10 2 and γ = best C = The testing dataset is divided into two accordingly.for RBF kernel SVM. Next we describe three dif-Predicted models from copula sentence datasetsferent experiments and present individual results.are applied on copula sentences of testing datasetGraph summarization makes it feasible to exe-and vice versa for non- copula sentences. The pre-cute SSL on very large unlabeled datasets, whichdicted scores are combined to measure overall per-was otherwise impossible. This paper has no as-formance of Hybrid SVM models. We repeatedsumptions on the performance of the method inthe experiments five times with different randomcomparison to other SSL methods.samples and averaged the results.Experiment 1. Here we test individual con-tribution of each set of features on our QA sys-Note from Table 2 that, when the number oflabeled data is small (n i L < 10% ∗ n L ), graphtem. We applied SVM and our graph based SSLbased SSL, gSum SSL, has a better performancemethod with no summarization to learn modelscompared to SVM. As the percentage of labeledusing labeled training and testing datasets. ForSSL we used the training as labeled and testingpoints in training data increase, the SVM perfor-mance increases, however graph summary SSL isas unlabeled dataset in transductive way to pre-still comparable with SVM. On the other hand,dict the entailment scores. The results are shownin Table 1. From section 2.2, QTCF representswhen we build separate models for copula andnon-copula questions with different features, thequestion-type NER match feature, LexSem is thebundle of lexico-semantic features and QComp isperformance of the overall model significantly in-the matching features of subject, head, object, andcreases in both methods. Especially in Hybridthree complements. In comparison to the baseline,graph-Summary SSL, Hybrid gSum SSL, whenQComp have a significant effect on the accuracythe number of labeled data is small (n i L < 25% ∗of the QA system. In addition, QTCF has shownn L ) performance improvement is better than rest% SVM gSum SSL Hybrid SVM Hybrid gSum SSL#Labeled MRR Top1 Top5 MRR Top1 Top5 MRR Top1 Top5 MRR Top1 Top51% 45.2 33.2 65.8 56.1 44.6 72.8 51.6 40.1 70.8 59.7 47.0 75.25% 56.5 45.1 73.0 57.3 46.0 73.7 54.2 40.6 72.3 60.3 48.5 76.710% 59.3 47.5 76.7 57.9 46.5 74.2 57.7 47.0 74.2 60.4 48.5 77.225% 59.8 49.0 78.7 58.4 45.0 79.2 61.4 49.5 78.2 60.6 49.0 76.750% 60.9 48.0 80.7 58.9 45.5 79.2 62.2 51.0 79.7 61.3 50.0 77.2100% 63.5 55.4 77.7 59.7 47.5 79.7 67.6 58.0 82.2 61.9 51.5 78.2Table 2: The MRR (%) results of graph-summary SSL (gSum SSL) and SVM as well as Hybrid gSumSSL and Hybrid SVM with different sizes of labeled data.that the number of unlabeled data has positive ef-#Unlabeled MRR Top1 Top5fect on performance of graph summarization SSL.25K 62.1% 52.0% 76.7%50K 62.5% 52.5% 77.2%100K 63.3% 54.0% 77.2%Table 3: The effect of number of unlabeled data6 Conclusions and Discussionson MRR from Hybrid graph Summarization SSL.In this paper, we applied a graph-based SSL al-of the models. As more labeled data is introduced,gorithm to improve the performance of QA taskHybrid SVM models’ performance increase dras-by exploiting unlabeled entailment relations be-tically, even outperforming the state-of-the arttween affirmed question and candidate sentenceMRR performance on TREC04 datasets presentedpairs. Our semantic and syntactic features for tex-in (Shen and Klakow, 2006) i.e., MRR=67.0%,tual entailment analysis has individually shown toTop1=62.0%, Top5=74.0%. This is due to the factimprove the performance of the QA compared tothat we establish two seperate entailment modelsthe baseline. We proposed a new graph repre-for copula and non-copula q/a sentence pairs thatsentation for SSL that can represent textual en-enables extracting useful information and bettertailment relations while embedding different ques-representation of the specific data.tion structures. We demonstrated that summariza-Experiment 3. Although SSL methods are ca-tion on graph-based SSL can improve the QA taskpable of exploiting information from unlabeledperformance when more unlabeled data is used todata, learning becomes infeasible as the numberlearn the classifier model.of data points gets very large. There are vari-There are several directions to improve ourous research on SLL to overcome the usage ofwork: (1) The results of our graph summarizationlarge number of unlabeled dataset challenge (De-on very large unlabeled data is slightly less thanlalleau et al., 2006). Our graph summarizationbest SVM results. This is largely due to usingmethod, Hybrid gsum SSL, has a different ap-proach. which can summarize very large datasetsheadlines instead of affirmed questions, whereinheadlines does not contain question-type and someinto representative data points and embed the orig-of them are not in proper sentence form. This ad-inal spatial information of data points, namely lo-versely effects the named entity match of question-cal density constraints, within the SSL summa-type and the candidate sentence named entities asrization schema. We demonstrate that as more la-well as semantic match component feature extrac-beled data is used, we would have a richer sum-tion. We will investigate experiment 3 by usingmary dataset with additional spatial informationthat would help to improve the the performancereal questions from different sources and constructof the graph summary models. We gradually in-different test datasets. (2) We will use other dis-crease the number of unlabeled data samples astance measures to better explain entailment be-shown in Table 3 to demonstrate the effects on thetween q/a pairs and compare with other semi-performance of testing dataset. The results showsupervised and transductive approaches.ReferencesHoracio Saggion and Robert Gaizauskas. 2006. Ex-periments in passage selection and answer extrac-Jinxiu Chen, Donghong Ji, C. Lim Tan, and Zhengyution for question answering. In Advances in naturalNiu. 2006. Relation extraction using label propaga-language processing, pages 291–302. Springer.tion based semi-supervised learning. In Proceedingsof the ACL-2006.Dan Shen and Dietrich Klakow. 2006. Exploring cor-relation of dependency relation paths for answer ex-Charles L.A. Clarke, Gordon V. Cormack, R. Thomastraction. In Proceedings of ACL-2006.Lynam, and Egidio L. Terra. 2006. Question an-Vikas Sindhwani, Wei Chu, and S. Sathiya Keerthi.swering by passage selection. In In: Advances inopen domain question answering, Strzalkowski, and