3 GROUP L1REGULARIZATION FOR FEATURE(5)G=1I=1LEARNINGΘG∥2, ∀ G.SUBJE...

3.3 Group L

1

Regularization for Feature

(5)

g=1

i=1

Learning

θ

g

2

, g.

subject to α

g

≥ ∥

In the context of cQA summarization task, some fea-

tures are intuitively to be more important than oth-

This formulation transforms the non-differentiable

ers. As a result, we group the parameters in our CRF

regularizer to a simple linear function and maximiz-

model with their related features

3

and introduce a

ing Equation 5 will lead to a solution to Equation 4

group L

1

-regularization term for selecting the most

because it is a lower bound of the latter. Then, we

useful features from the least important ones, where

add a sufficient small positive constant ε when com-

the regularization term becomes,

θ

g

2

=

puting the L

2

norm (Lee et al., 2006), i.e., |

√∑

|g|

G

j=1

θ

2

gj

+ ε, where | g | denotes the number of

R(θ) = C

features in group g. To obtain the optimal value of

θ

g

2

, (3)

parameter θ from the training data, we use an effi-

cient L-BFGS solver to solve the problem, and the

where C controls the penalty magnitude of the pa-

first derivative of every feature j in group g is,

rameters, G is the number of feature groups and

θ

g

N

denotes the parameters corresponding to the partic-

δL

C

gj

(y

(i)

, x

(i)

)

ular group g. Notice that this penalty term is indeed

δθ

gj

=

y

a L(1, 2) regularization because in every particu-

(6)

lar group we normalize the parameters in L

2

norm

√∑

|g|

p(y | x

(i)

)C

gj

(y, x

(i)

) 2C θ

gj

while the weight of a whole group is summed in L

1

l=1

θ

2

gl

+ ε

form.

Given a set of training data D = (x

(i)

, y

(i)

), i =

where C

gj

(y, x) denotes the count of feature j in