INTRODUCTION QUESTION ANSWERING (QA), A SUBFIELD OF INFORMATI...

1. Introduction

Question Answering (QA), a subfield of Information Retrieval (IR) and Natural

Language Processing (NLP), aims to build computer systems, which can

automatically answer questions of users in a natural language. These systems are

widely applied in more and more fields such as e-commerce, business, and education.

Nowadays, students everywhere carry their mobile phone/laptop with them. It helps

students to connect with the world. Therefore, as a trend, universities need to develop

their own QA system to foster students’ engagement anytime and anywhere. This

brings multiple benefits to both students and universities. For students, they can easily

get information about a university/college such as degrees, programs, courses,

lecturers, campus, admission conditions, and scholarships. For universities, it helps

in recruiting new students by facilitating the students in seeking out a

college/university’s information; in ensuring constant communication: provide

instant for multi-users with 24/7/365 feedback especially in admission periods; and

creating a universally accessible website for the university.

There are two main approaches to build a QA system: 1) Information Retrieval

(IR) based approach, and 2) knowledge-based approach. An IR-based QA system

consists of three steps. First, the question is processed to extract important

information (question analysis step). Next, the processed question serves as the input

for information retrieval on the Word Wide Web (WWW) or on a collection of

documents. Answer candidates are then extracted from the returned documents

(answer extraction step). The final answer is selected among the candidates (answer

selection step). While an IR-based QA method finds the answer from the WWW or

a collection of (plain) documents, a knowledge-based QA method computes the

answer using existing knowledge bases in two steps. The first step, question analysis,

is similar to the one in an IR-based system. In the next step, a query or formal

representation is formed from extracted important information, which is then used to

query over existing knowledge bases to retrieve the answer.

Question analysis, the task of extracting important information from the

question, is a key step in both IR-based and knowledge-based question answering.

Such information will be exploited to extract answer candidates and select the final

answer in an IR-based QA system or to form the query or formal representation in a

knowledge-based QA system. Without extracted information in the question analysis

step, the system could not “understand” the question and, therefore, fails to find the

correct answer. A lot of studies have been conducted on question analysis. Most of

them fall into one of two categories: 1) question classification or intent detection

[9, 12, 17, 18] and 2) Named Entity Recognition (NER) in questions [2, 20]. While

question classification determines the type of question or the type of the expected

answer, the task of NER aims to extract important information expressed by named

entities in the questions.

In this work, we deal with the task of Vietnamese question analysis in the

education domain. Given a Vietnamese question. Our goal is to extract named entities

in the question, such as university names, campus names, department names, major

names, lecturer names, numbers, school years, time, and duration. Table 1 shows

examples of questions, named entities in those questions, and their translations in

English. The outputs of the task can be exploited to develop an online, web-based or

mobile app, QA system. We investigate several methods to deal with the task,

including traditional probabilistic graphical models like Conditional Random Fields

(CRFs) and more advanced deep neural networks with Convolutional Neural

Networks (CNNs) and Bidirectional Long Short-Term Memory (BiLSTM) networks.

Although CRFs can be used to train an accurate recognition model with a quite small

annotated dataset, we need a manually designed feature set. Recent advanced deep

neural networks have been shown to be powerful models, which can achieve very

high performance with automatically learned features from raw data. Neural

networks, however, are data hungry. They need to be trained on a quite large dataset,

which is challenging for the task in a specific domain. To overcome such challenges,

we introduce a recognition models that integrates multiple neural network layers for

learning word and sentence representations, and a CRF layer for inference. By

utilizing both automatically learned and manually engineered features, our models

outperform competitive baselines, including a CRF model and neural network models

that use only automatically learned features.

Table 1. Examples of Vietnamese questions and named entities in the education domain No Questions Entities Học phí [ngành kế toán][năm nay] bao nhiêu ạ? – ngành kế toán (Accounting Program): How much is the tuition fee of the [Accounting a major/program name 1 – năm nay (this year): time Program][this year] ? – Sinh viên năm nhất (freshmen): the [Sinh viên năm nhất] học ở [Ngụy Như] hay [Thanh Xuân] ạ? academic year of students (first year) – Ngụy Như (Nguy Nhu): a campus Do[freshmen] study at [Nguy Nhu] or [Thanh 2 Xuan]? name – Thanh Xuân (Thanh Xuan): a campus – cô Ngân (Ms. Ngan): the name of a Cho em hỏi số điện thoại của [cô Ngân] ở [phòng đào tạo] ạ? staff 3 – phòng đào tạo (Training Department): Could you please tell me the phone number of [Ms.Ngan] from the [Training Department]? a department name Điều kiện để nhận [học bổng Yamada] là gì ạ? – học bổng Yamada (Yamada 4 What are the conditions for [Yamada scholarship): the name of a scholarship Scholarship]? program

Our contributions can be summarized in the following points: 1) we present

several models for recognizing named entities in Vietnamese questions, which

combine traditional statistical methods and advanced deep neural networks with a

rich feature set; 2) we introduce an annotated corpus for the task, consisting of 3,600

Vietnamese questions collected from the online forum of the VNU International

School. The dataset will be made available at publication time; and 3) we empirically

verify the effectiveness of the proposed models by conducting a series of experiments

and analyses on that corpus. Compared to previous studies [2, 5, 15, 21, 24, 25], we

focus on the education domain and exploit advanced machine learning techniques,

i.e. deep neural networks.