1. Introduction
Question Answering (QA), a subfield of Information Retrieval (IR) and Natural
Language Processing (NLP), aims to build computer systems, which can
automatically answer questions of users in a natural language. These systems are
widely applied in more and more fields such as e-commerce, business, and education.
Nowadays, students everywhere carry their mobile phone/laptop with them. It helps
students to connect with the world. Therefore, as a trend, universities need to develop
their own QA system to foster students’ engagement anytime and anywhere. This
brings multiple benefits to both students and universities. For students, they can easily
get information about a university/college such as degrees, programs, courses,
lecturers, campus, admission conditions, and scholarships. For universities, it helps
in recruiting new students by facilitating the students in seeking out a
college/university’s information; in ensuring constant communication: provide
instant for multi-users with 24/7/365 feedback especially in admission periods; and
creating a universally accessible website for the university.
There are two main approaches to build a QA system: 1) Information Retrieval
(IR) based approach, and 2) knowledge-based approach. An IR-based QA system
consists of three steps. First, the question is processed to extract important
information (question analysis step). Next, the processed question serves as the input
for information retrieval on the Word Wide Web (WWW) or on a collection of
documents. Answer candidates are then extracted from the returned documents
(answer extraction step). The final answer is selected among the candidates (answer
selection step). While an IR-based QA method finds the answer from the WWW or
a collection of (plain) documents, a knowledge-based QA method computes the
answer using existing knowledge bases in two steps. The first step, question analysis,
is similar to the one in an IR-based system. In the next step, a query or formal
representation is formed from extracted important information, which is then used to
query over existing knowledge bases to retrieve the answer.
Question analysis, the task of extracting important information from the
question, is a key step in both IR-based and knowledge-based question answering.
Such information will be exploited to extract answer candidates and select the final
answer in an IR-based QA system or to form the query or formal representation in a
knowledge-based QA system. Without extracted information in the question analysis
step, the system could not “understand” the question and, therefore, fails to find the
correct answer. A lot of studies have been conducted on question analysis. Most of
them fall into one of two categories: 1) question classification or intent detection
[9, 12, 17, 18] and 2) Named Entity Recognition (NER) in questions [2, 20]. While
question classification determines the type of question or the type of the expected
answer, the task of NER aims to extract important information expressed by named
entities in the questions.
In this work, we deal with the task of Vietnamese question analysis in the
education domain. Given a Vietnamese question. Our goal is to extract named entities
in the question, such as university names, campus names, department names, major
names, lecturer names, numbers, school years, time, and duration. Table 1 shows
examples of questions, named entities in those questions, and their translations in
English. The outputs of the task can be exploited to develop an online, web-based or
mobile app, QA system. We investigate several methods to deal with the task,
including traditional probabilistic graphical models like Conditional Random Fields
(CRFs) and more advanced deep neural networks with Convolutional Neural
Networks (CNNs) and Bidirectional Long Short-Term Memory (BiLSTM) networks.
Although CRFs can be used to train an accurate recognition model with a quite small
annotated dataset, we need a manually designed feature set. Recent advanced deep
neural networks have been shown to be powerful models, which can achieve very
high performance with automatically learned features from raw data. Neural
networks, however, are data hungry. They need to be trained on a quite large dataset,
which is challenging for the task in a specific domain. To overcome such challenges,
we introduce a recognition models that integrates multiple neural network layers for
learning word and sentence representations, and a CRF layer for inference. By
utilizing both automatically learned and manually engineered features, our models
outperform competitive baselines, including a CRF model and neural network models
that use only automatically learned features.
Table 1. Examples of Vietnamese questions and named entities in the education domain No Questions Entities Học phí [ngành kế toán][năm nay] bao nhiêu ạ? – ngành kế toán (Accounting Program): How much is the tuition fee of the [Accounting a major/program name 1 – năm nay (this year): time Program][this year] ? – Sinh viên năm nhất (freshmen): the [Sinh viên năm nhất] học ở [Ngụy Như] hay [Thanh Xuân] ạ? academic year of students (first year) – Ngụy Như (Nguy Nhu): a campus Do[freshmen] study at [Nguy Nhu] or [Thanh 2 Xuan]? name – Thanh Xuân (Thanh Xuan): a campus – cô Ngân (Ms. Ngan): the name of a Cho em hỏi số điện thoại của [cô Ngân] ở [phòng đào tạo] ạ? staff 3 – phòng đào tạo (Training Department): Could you please tell me the phone number of [Ms.Ngan] from the [Training Department]? a department name Điều kiện để nhận [học bổng Yamada] là gì ạ? – học bổng Yamada (Yamada 4 What are the conditions for [Yamada scholarship): the name of a scholarship Scholarship]? program Our contributions can be summarized in the following points: 1) we present
several models for recognizing named entities in Vietnamese questions, which
combine traditional statistical methods and advanced deep neural networks with a
rich feature set; 2) we introduce an annotated corpus for the task, consisting of 3,600
Vietnamese questions collected from the online forum of the VNU International
School. The dataset will be made available at publication time; and 3) we empirically
verify the effectiveness of the proposed models by conducting a series of experiments
and analyses on that corpus. Compared to previous studies [2, 5, 15, 21, 24, 25], we
focus on the education domain and exploit advanced machine learning techniques,
i.e. deep neural networks.
Bạn đang xem 1. - QUESTION ANALYSIS TOWARDS A VIETNAMESE QUESTION ANSWERING SYSTEM IN THE EDUCATION DOMAIN