3 GROUP L1REGULARIZATION FOR FEATURE(5)G=1I=1LEARNINGΘG∥2, ∀ G.SUBJE...
3.3 Group L
1
Regularization for Feature
(5)
g=1
i=1
Learning
θ
g
∥
2
, ∀ g.
subject to α
g
≥ ∥ − →
In the context of cQA summarization task, some fea-
tures are intuitively to be more important than oth-
This formulation transforms the non-differentiable
ers. As a result, we group the parameters in our CRF
regularizer to a simple linear function and maximiz-
model with their related features
3
and introduce a
ing Equation 5 will lead to a solution to Equation 4
group L
1
-regularization term for selecting the most
because it is a lower bound of the latter. Then, we
useful features from the least important ones, where
add a sufficient small positive constant ε when com-
the regularization term becomes,
θ
g
∥
2
=
puting the L
2
norm (Lee et al., 2006), i.e., | − →
√∑
|g|
∑
G
j=1
θ
2
gj
+ ε, where | g | denotes the number of
∥ − →
R(θ) = C
features in group g. To obtain the optimal value of
θ
g
∥
2
, (3)
parameter θ from the training data, we use an effi-
cient L-BFGS solver to solve the problem, and the
where C controls the penalty magnitude of the pa-
first derivative of every feature j in group g is,
rameters, G is the number of feature groups and − →
θ
g
∑
∑
N
denotes the parameters corresponding to the partic-
δL
C
gj
(y
(i)
, x
(i)
) −
ular group g. Notice that this penalty term is indeed
δθ
gj
=
y
a L(1, 2) regularization because in every particu-
(6)
lar group we normalize the parameters in L
2
norm
√∑
|g|
p(y | x
(i)
)C
gj
(y, x
(i)
) − 2C θ
gj
while the weight of a whole group is summed in L
1
l=1
θ
2
gl
+ ε
form.
Given a set of training data D = (x
(i)
, y
(i)
), i =
where C
gj