2 FEATURES FROM PAIRED SENTENCE ANALYSISINFORMATION BETWEEN THEM.WE...

2.2 Features from Paired Sentence Analysis

information between them.

We extract the TE features based on the above lex-

3 Graph Based Semi-Supervised

ical, syntactic and semantic analysis of q/a pairs

Learning for Entailment Ranking

and cast the QA task as a classification problem.

Among many syntactic and semantic features we

We formulate semi-supervised entailment rank

considered, here we present only the major ones:

scores as follows. Let each data point in

(1) (QTCF) Question-Type-Candidate Sen-

X = {x 1 , ..., x n }, x i ∈ < d represents infor-

tence NER match feature: Takes on the value

mation about a question and candidate sentence

’1’ when the candidate sentence contains the fine

pair and Y = {y 1 , ..., y n } be their output la-

NER of the question-type, ’0.5’ if it contains the

bels. The labeled part of X is represented with

coarse NER or ’0’ if no NER match is found.

X L = {x 1 , ..., x l } with associated labels Y L =

(2) (QComp) Question component match fea-

{y 1 , ..., y l } T . For ease of presentation we concen-

tures: The sentence component analysis is applied

trate on binary classification, where y i can take

on both the affirmed question and the candidate

on either of {−1, +1} representing entailment or

sentence pairs to characterize their semantic com-

non-entailment. X has also unlabeled part, X U =

ponents including subject(S), object(O), head (H)

{x 1 , ..., x u }, i.e., X = X L ∪ X U . The aim is to

and modifiers(M). We match each semantic com-

predict labels for X U . There are also other testing

ponent of a question to the best matching com-

points, X T e , which has the same properties as X.

Each node V in graph g = (V, E) represents a

1

One option would have been to leave out the non-copula

feature vector, x i ∈ < d of a q/a pair, characteriz-

questions and build the model for only copula questions.

ing their entailment relation information. When all

Most graph-based SSLs are transductive, i.e., not

components of a hypothesis (affirmative question)

easily expendable to new test points outside L∪U .

have high similarity with components of text (can-

In (Delalleau et al., 2005) an induction scheme is

didate sentence), then entailment score between

proposed to classify a new point x T e by

them would be high. Another pair of q/a sentences

P

f ˆ (x T e ) =

i∈L∪U w x

i

f i

(6)

with similar structures would also have high en-

i∈L∪U w x

i

tailment scores as well. So similarity between two

Thus, we use induction, where we can, to avoid

q/a pairs x i , x j , is represented with w ij ∈ < n×n ,

re-construction of the graph for new test points.

i.e., edge weights, and is measured as:

d

4 Graph Summarization

|x

iq

−x

jq

|

w ij = 1 −

d (1)

q=1

Research on graph-based SSL algorithms point

As total entailment scores get closer, the larger

out their effectiveness on real applications, e.g.,

(Zhu et al., 2003), (Zhou and Sch¨ olkopf, 2004),

their edge weights would be. Based on our sen-

(Sindhwani et al., 2007). However, there is still

tence structure analysis in section 2, given dataset

can be further separated into two, i.e., X cp con-

a need for fast and efficient SSL methods to deal

taining q/a pairs in which affirmed questions are

with vast amount of data to extract useful informa-

tion. It was shown in (Delalleau et al., 2006) that

copula-type, and X ncp containing q/a pairs with

non-copula-type affirmed questions. Since cop-

the convergence rate of the propagation algorithms

ula and non-copula sentences have different struc-

of SSL methods is O(kn 2 ), which mainly depends

tures, e.g., copula sentences does not usually have

on the form of eigenvectors of the graph Laplacian

objects, we used different sets of features for each

(k is the number of nearest neighbors). As the

type. Thus, we modify edge weights in (1) as fol-

weight matrix gets denser, meaning there will be

more data points with connected weighted edges,

lows:

the more it takes to learn the classifier function via

0 x i ∈ X cp , x j ∈ X ncp

 

graph. Thus, the question is, how can one reduce

d

cp

|x

iq

−x

jq

|

 

1 −

the data points so that weight matrix is sparse, and

d

cp

x i , x j ∈ X cp

˜

w ij =

it takes less time to learn?

d

ncp

Our idea of summarization is to create repre-

 

d

ncp

x i , x j ∈ X ncp

sentative vertices of data points that are very close

(2)

to each other in terms of edge weights. Suffice to

The diagonal degree matrix D is defined for graph

say that similar data points are likely to represent

g by D= P

j w ˜ ij . In general graph-based SSL, a

denser regions in the hyper-space and are likely to

function over the graph is estimated such that it

have same labels. If these points are close enough,

satisfies two conditions: 1) close to the observed

we can characterize the boundaries of these group

labels , and 2) be smooth on the whole graph by:

of similar data points with respect to graph and

X

then capture their summary information by new

(f i − y i ) 2 +λ X

arg min f

w ij (f i − f j ) 2

representative vertices. We replace each data point

i,j∈L∪U

i⊂L

(3)

within the boundary with their representative ver-

The second term is a regularizer to represent the

tex, to form a summary graph.

label smoothness, f T Lf , where L = D − W is the