2.2. WORD REPRESENTATION USING BILSTM NETWORKS AS ILLUSTRATED IN...

Question

3.2.2. Word representation using BiLSTM networks As  illustrated  in  Fig.  4,  our  second  method  to  produce  the  word  representation  is similar to the first method presented in the previous section, except that we now use BiLSTM networks to learn the character representation instead of using CNNs. In the following, we give a brief introduction to BiLSTM networks and explain how  to  apply  them  to  character  embeddings  for  producing  the  character representation  of  the  whole  word.  Note  that  the  process  of  applying  BiLSTM networks to the word representations in the sentence representation stage is similar.  Besides  CNNs,  Recurrent  Neural  Networks  (RNNs)  [6]  are  one  of  the  most popular  and  successful  deep  neural  network  architectures,  which  are  specifically designed  to  process  sequence  data  such  as  natural  languages.  Long  Short-Term Memory (LSTM) networks [8] are a variant of RNNs, which can deal with the long-range dependency problem by using some gates at each position to control the passing of information along the sequence.  Fig. 4. Word representation using BiLSTM networks Recall  that  we  want  to  learn  the  representation  of  a  word  represented  by (𝐱1, 𝐱2, … , 𝐱𝑚),  where  𝐱𝑖  is  the  character  embedding  of  the  i-th  character  and  𝑚 denotes the length (in characters) of the word. At each position 𝑖, the LSTM network generates an output 𝐲𝑖 based on a hidden state 𝐡𝑖𝐲𝑖= 𝜎(𝐔𝑦𝐡𝑖+ 𝐛𝑦), where the hidden state 𝐡𝑖 is updated by several gates, including an input gate 𝐈𝑖 , a forget gate 𝐅𝑖, an output gate 𝐎𝑖, and a memory cell 𝐂𝑖 as follows: 𝐈𝑖 = 𝜎(𝐔I𝐱𝑖+ 𝐕I𝐡𝑖−1+ 𝐛I), 𝐅𝑖 = 𝜎(𝐔F𝐱𝑖+ 𝐕F𝐡𝑖−1+ 𝐛F), 𝐎𝑖 = 𝜎(𝐔O𝐱𝑖+ 𝐕O𝐡𝑖−1+ 𝐛O), 𝐂𝑖 = 𝐅𝑖⊙ 𝐂𝑖−1+ 𝐈𝑖⊙ tanh (𝐔C𝐱𝑖+ 𝐕C𝐡𝑖−1+ 𝐛C), 𝐡𝑖 = 𝐎𝑖⊙ tanh (𝐂𝑖) In  the  above  equations,  σ  and  ⊙  denote  the  element-wise  softmax  and multiplication operator functions, respectively; 𝐔, 𝐕 are weight matrices, 𝐛 are bias vectors, which are learned during the training process. LSTM networks are used to model sequence data from one direction, usually from  left  to  right.  To  capture  the  information  from  both  directions,  our  model employs  Bidirectional  LSTM  (BiLSTM)  networks [7].  The  main idea  of  BiLSTM networks  is  that  it  integrates  two  LSTM  networks,  one  moves  from  left  to  right (forward LSTM) and the other one moves in the opposite direction, i.e. from right to left  (backward  LSTM).  Specifically,  the  hidden  state  𝐡𝑖  of  the  BiLSTM  is  the concatenation of the hidden states of two LSTMs.