INTRODUCTION QUESTION ANSWERING (QA), A SUBFIELD OF INFORMATI...

Question

1. Introduction Question  Answering  (QA),  a  subfield  of  Information  Retrieval  (IR)  and  Natural Language  Processing  (NLP),  aims  to  build  computer  systems,  which  can automatically  answer  questions  of  users  in  a  natural  language.  These  systems  are widely applied in more and more fields such as e-commerce, business, and education. Nowadays, students everywhere carry their mobile phone/laptop with them. It helps students to connect with the world. Therefore, as a trend, universities need to develop their own QA system to foster students’ engagement anytime and anywhere. This brings multiple benefits to both students and universities. For students, they can easily get  information  about  a  university/college  such  as  degrees,  programs,  courses, lecturers, campus, admission conditions, and scholarships. For universities, it helps in  recruiting  new  students  by  facilitating  the  students  in  seeking  out  a college/university’s  information;  in  ensuring  constant  communication:  provide instant for multi-users with 24/7/365 feedback especially in admission periods; and creating a universally accessible website for the university. There are two main approaches to build a QA system: 1) Information Retrieval (IR)  based  approach,  and  2)  knowledge-based  approach.  An  IR-based  QA  system consists  of  three  steps.  First,  the  question  is  processed  to  extract  important information (question analysis step). Next, the processed question serves as the input for  information  retrieval  on  the  Word  Wide  Web  (WWW)  or  on  a  collection  of documents.  Answer  candidates  are  then  extracted  from  the  returned  documents (answer extraction step). The final answer is selected among the candidates (answer selection step). While an IR-based QA method finds the answer from the WWW or a  collection  of  (plain)  documents,  a  knowledge-based  QA  method  computes  the answer using existing knowledge bases in two steps. The first step, question analysis, is  similar  to  the  one  in  an  IR-based  system.  In  the  next  step,  a  query  or  formal representation is formed from extracted important information, which is then used to query over existing knowledge bases to retrieve the answer. Question  analysis,  the  task  of  extracting  important  information  from  the question, is a key step in both IR-based and knowledge-based question answering. Such information will be exploited to extract answer candidates and select the final answer in an IR-based QA system or to form the query or formal representation in a knowledge-based QA system. Without extracted information in the question analysis step, the system could not “understand” the question and, therefore, fails to find the correct answer. A lot of studies have been conducted on question analysis. Most of them  fall  into  one  of  two  categories:  1)  question  classification  or  intent  detection  [9, 12, 17, 18] and 2) Named Entity Recognition (NER) in questions [2, 20]. While question  classification  determines  the type  of question  or the  type  of  the expected answer, the task of NER aims to extract important information expressed by named entities in the questions. In  this  work,  we  deal  with  the  task  of  Vietnamese  question  analysis  in  the education domain. Given a Vietnamese question. Our goal is to extract named entities in the question, such as university names, campus names, department names, major names,  lecturer  names,  numbers,  school  years,  time,  and  duration.  Table  1  shows examples  of  questions,  named  entities  in  those  questions,  and  their  translations  in English. The outputs of the task can be exploited to develop an online, web-based or mobile  app,  QA  system.  We  investigate  several  methods  to  deal  with  the  task, including traditional probabilistic graphical models like Conditional Random Fields (CRFs)  and  more  advanced  deep  neural  networks  with  Convolutional  Neural Networks (CNNs) and Bidirectional Long Short-Term Memory (BiLSTM) networks. Although CRFs can be used to train an accurate recognition model with a quite small annotated dataset, we need a manually designed feature set. Recent advanced deep neural networks have been shown to be powerful models, which can achieve very high  performance  with  automatically  learned  features  from  raw  data.  Neural networks, however, are data hungry. They need to be trained on a quite large dataset, which is challenging for the task in a specific domain. To overcome such challenges, we introduce a recognition models that integrates multiple neural network layers for learning  word  and  sentence  representations,  and  a  CRF  layer  for  inference.  By utilizing both automatically learned and manually engineered features, our models outperform competitive baselines, including a CRF model and neural network models that use only automatically learned features.  Table 1. Examples of Vietnamese questions and named entities in the education domain No  Questions  Entities Học phí [ngành kế toán][năm nay] bao nhiêu ạ? – ngành kế toán (Accounting Program): How much is the tuition fee of the [Accounting a major/program name 1 – năm nay (this year): time Program][this year] ? – Sinh viên năm nhất (freshmen): the [Sinh viên năm nhất] học ở [Ngụy Như] hay [Thanh Xuân] ạ? academic year of students (first year) – Ngụy Như (Nguy Nhu): a campus Do[freshmen] study at [Nguy Nhu] or [Thanh 2 Xuan]? name – Thanh Xuân (Thanh Xuan): a campus – cô Ngân (Ms. Ngan): the name of a Cho em hỏi số điện thoại của [cô Ngân] ở [phòng đào tạo] ạ? staff 3 – phòng đào tạo (Training Department): Could you please tell me the phone number of [Ms.Ngan] from the [Training Department]? a department name Điều kiện để nhận [học bổng Yamada] là gì ạ? – học bổng Yamada (Yamada 4 What are the conditions for [Yamada scholarship): the name of a scholarship Scholarship]? program Our  contributions  can  be  summarized  in  the  following  points:  1)  we  present several  models  for  recognizing  named  entities  in  Vietnamese  questions,  which combine  traditional  statistical  methods  and  advanced  deep  neural  networks  with  a rich feature set; 2) we introduce an annotated corpus for the task, consisting of 3,600 Vietnamese  questions  collected  from  the  online  forum  of  the  VNU  International School. The dataset will be made available at publication time; and 3) we empirically verify the effectiveness of the proposed models by conducting a series of experiments and analyses on that corpus. Compared to previous studies [2, 5, 15, 21, 24, 25], we focus on the education domain and exploit advanced machine learning techniques, i.e. deep neural networks.

INTRODUCTION QUESTION ANSWERING (QA), A SUBFIELD OF INFORMATI...

1. Introduction

Question Answering (QA), a subfield of Information Retrieval (IR) and Natural

Language Processing (NLP), aims to build computer systems, which can

automatically answer questions of users in a natural language. These systems are

widely applied in more and more fields such as e-commerce, business, and education.

Nowadays, students everywhere carry their mobile phone/laptop with them. It helps

students to connect with the world. Therefore, as a trend, universities need to develop

their own QA system to foster students’ engagement anytime and anywhere. This

brings multiple benefits to both students and universities. For students, they can easily

get information about a university/college such as degrees, programs, courses,

lecturers, campus, admission conditions, and scholarships. For universities, it helps

in recruiting new students by facilitating the students in seeking out a

college/university’s information; in ensuring constant communication: provide

instant for multi-users with 24/7/365 feedback especially in admission periods; and

creating a universally accessible website for the university.

There are two main approaches to build a QA system: 1) Information Retrieval

(IR) based approach, and 2) knowledge-based approach. An IR-based QA system

consists of three steps. First, the question is processed to extract important

information (question analysis step). Next, the processed question serves as the input

for information retrieval on the Word Wide Web (WWW) or on a collection of

documents. Answer candidates are then extracted from the returned documents

(answer extraction step). The final answer is selected among the candidates (answer

selection step). While an IR-based QA method finds the answer from the WWW or

a collection of (plain) documents, a knowledge-based QA method computes the

answer using existing knowledge bases in two steps. The first step, question analysis,

is similar to the one in an IR-based system. In the next step, a query or formal

representation is formed from extracted important information, which is then used to

query over existing knowledge bases to retrieve the answer.

Question analysis, the task of extracting important information from the

question, is a key step in both IR-based and knowledge-based question answering.

Such information will be exploited to extract answer candidates and select the final

answer in an IR-based QA system or to form the query or formal representation in a

knowledge-based QA system. Without extracted information in the question analysis

step, the system could not “understand” the question and, therefore, fails to find the

correct answer. A lot of studies have been conducted on question analysis. Most of

them fall into one of two categories: 1) question classification or intent detection

[9, 12, 17, 18] and 2) Named Entity Recognition (NER) in questions [2, 20]. While

question classification determines the type of question or the type of the expected

answer, the task of NER aims to extract important information expressed by named

entities in the questions.

In this work, we deal with the task of Vietnamese question analysis in the

education domain. Given a Vietnamese question. Our goal is to extract named entities

in the question, such as university names, campus names, department names, major

names, lecturer names, numbers, school years, time, and duration. Table 1 shows

examples of questions, named entities in those questions, and their translations in

English. The outputs of the task can be exploited to develop an online, web-based or

mobile app, QA system. We investigate several methods to deal with the task,

including traditional probabilistic graphical models like Conditional Random Fields

(CRFs) and more advanced deep neural networks with Convolutional Neural

Networks (CNNs) and Bidirectional Long Short-Term Memory (BiLSTM) networks.

Although CRFs can be used to train an accurate recognition model with a quite small

annotated dataset, we need a manually designed feature set. Recent advanced deep

neural networks have been shown to be powerful models, which can achieve very

high performance with automatically learned features from raw data. Neural

networks, however, are data hungry. They need to be trained on a quite large dataset,

which is challenging for the task in a specific domain. To overcome such challenges,

we introduce a recognition models that integrates multiple neural network layers for

learning word and sentence representations, and a CRF layer for inference. By

utilizing both automatically learned and manually engineered features, our models

outperform competitive baselines, including a CRF model and neural network models

that use only automatically learned features.

Our contributions can be summarized in the following points: 1) we present

several models for recognizing named entities in Vietnamese questions, which

combine traditional statistical methods and advanced deep neural networks with a

rich feature set; 2) we introduce an annotated corpus for the task, consisting of 3,600

Vietnamese questions collected from the online forum of the VNU International

School. The dataset will be made available at publication time; and 3) we empirically

verify the effectiveness of the proposed models by conducting a series of experiments

and analyses on that corpus. Compared to previous studies [2, 5, 15, 21, 24, 25], we

focus on the education domain and exploit advanced machine learning techniques,

i.e. deep neural networks.

Bạn đang xem 1. - QUESTION ANALYSIS TOWARDS A VIETNAMESE QUESTION ANSWERING SYSTEM IN THE EDUCATION DOMAIN