2.2. WORD REPRESENTATION USING BILSTM NETWORKS AS ILLUSTRATED IN...

3.2.2. Word representation using BiLSTM networks

As illustrated in Fig. 4, our second method to produce the word representation is

similar to the first method presented in the previous section, except that we now use

BiLSTM networks to learn the character representation instead of using CNNs.

In the following, we give a brief introduction to BiLSTM networks and explain

how to apply them to character embeddings for producing the character

representation of the whole word. Note that the process of applying BiLSTM

networks to the word representations in the sentence representation stage is similar.

Besides CNNs, Recurrent Neural Networks (RNNs) [6] are one of the most

popular and successful deep neural network architectures, which are specifically

designed to process sequence data such as natural languages. Long Short-Term

Memory (LSTM) networks [8] are a variant of RNNs, which can deal with the long-

range dependency problem by using some gates at each position to control the passing

of information along the sequence.

Fig. 4. Word representation using BiLSTM networks

Recall that we want to learn the representation of a word represented by

(𝐱

1

, 𝐱

2

, … , 𝐱

π‘š

), where 𝐱

𝑖

is the character embedding of the i-th character and π‘š

denotes the length (in characters) of the word. At each position 𝑖, the LSTM network

generates an output 𝐲

𝑖

based on a hidden state 𝐑

𝑖

𝐲

𝑖

= 𝜎(𝐔

𝑦

𝐑

𝑖

+ 𝐛

𝑦

),

where the hidden state 𝐑

𝑖

is updated by several gates, including an input gate 𝐈

𝑖

, a

forget gate 𝐅

𝑖

, an output gate 𝐎

𝑖

, and a memory cell 𝐂

𝑖

as follows:

𝐈

𝑖

= 𝜎(𝐔

I

𝐱

𝑖

+ 𝐕

I

𝐑

π‘–βˆ’1

+ 𝐛

I

),

𝐅

𝑖

= 𝜎(𝐔

F

𝐱

𝑖

+ 𝐕

F

𝐑

π‘–βˆ’1

+ 𝐛

F

),

𝐎

𝑖

= 𝜎(𝐔

O

𝐱

𝑖

+ 𝐕

O

𝐑

π‘–βˆ’1

+ 𝐛

O

),

𝐂

𝑖

= 𝐅

𝑖

βŠ™ 𝐂

π‘–βˆ’1

+ 𝐈

𝑖

βŠ™ tanh (𝐔

C

𝐱

𝑖

+ 𝐕

C

𝐑

π‘–βˆ’1

+ 𝐛

C

),

𝐑

𝑖

= 𝐎

𝑖

βŠ™ tanh (𝐂

𝑖

)

In the above equations, Οƒ and βŠ™ denote the element-wise softmax and

multiplication operator functions, respectively; 𝐔, 𝐕 are weight matrices, 𝐛 are bias

vectors, which are learned during the training process.

LSTM networks are used to model sequence data from one direction, usually

from left to right. To capture the information from both directions, our model

employs Bidirectional LSTM (BiLSTM) networks [7]. The main idea of BiLSTM

networks is that it integrates two LSTM networks, one moves from left to right

(forward LSTM) and the other one moves in the opposite direction, i.e. from right to

left (backward LSTM). Specifically, the hidden state 𝐑

𝑖

of the BiLSTM is the

concatenation of the hidden states of two LSTMs.