RECOGNITION MODELS GIVEN A VIETNAMESE INPUT QUESTION REPRESENT...

3. Recognition models

Given a Vietnamese input question represented as a sequence of words

𝑠 = 𝑤

1

𝑤

2

… 𝑤

𝑛

where n denotes the length (in words) of s, our goal is to extract all

the named entities in the question. A named entity is a word or a sequence of

consecutive words that provides information about campuses, lecturers, subjects,

departments, and so on. Such important information clarifies the question and need

to be extracted to answer to the question.

Our task belongs to information extraction, a subfield of natural language

processing which aims to extract important information from text. We cast our task

as a sequence tagging problem, which assigns a tag to each word in the input sentence

to indicate whether the word begins a named entity (tag B), is inside (not at the

beginning) a named entity (tag I), or outside all the named entities (tag O). Table 2

shows two examples of tagged sentences in the IOB notation. For example, the tag

B-MajorName indicates that the word begins a major name, while the tag

I-ScholarName indicates that the word is inside (not at the beginning) a scholarship

name.

Table 2. Examples of tagged sentences using the IOB notationHọc_phí/O ngành/B-MajorName kế_toán/I-MajorName năm/B-Datetime nay/I-Datetime bao_nhiêu/O ạ/O?/O (How much is the tuition fee of the Accounting Program this year?) Điều_kiện/O để/O nhận/O học_bổng/B-ScholarName Yamada/I-ScholarName là/O gì/O ạ/O?/O (What are the conditions for Yamada Scholarship?)

In the following we present our models for solving the above sequence tagging

task, including a CRF-based model and more advanced models with deep neural

networks. The CRF-based model exploits a traditional but powerful sequence

learning method (i.e., conditional random fields) with manually designed features,

which can be used as a strong baseline to compare with our neural models.