3. Recognition models
Given a Vietnamese input question represented as a sequence of words
𝑠 = 𝑤
1𝑤
2… 𝑤
𝑛 where n denotes the length (in words) of s, our goal is to extract all
the named entities in the question. A named entity is a word or a sequence of
consecutive words that provides information about campuses, lecturers, subjects,
departments, and so on. Such important information clarifies the question and need
to be extracted to answer to the question.
Our task belongs to information extraction, a subfield of natural language
processing which aims to extract important information from text. We cast our task
as a sequence tagging problem, which assigns a tag to each word in the input sentence
to indicate whether the word begins a named entity (tag B), is inside (not at the
beginning) a named entity (tag I), or outside all the named entities (tag O). Table 2
shows two examples of tagged sentences in the IOB notation. For example, the tag
B-MajorName indicates that the word begins a major name, while the tag
I-ScholarName indicates that the word is inside (not at the beginning) a scholarship
name.
Table 2. Examples of tagged sentences using the IOB notationHọc_phí/O ngành/B-MajorName kế_toán/I-MajorName năm/B-Datetime nay/I-Datetime bao_nhiêu/O ạ/O?/O (How much is the tuition fee of the Accounting Program this year?) Điều_kiện/O để/O nhận/O học_bổng/B-ScholarName Yamada/I-ScholarName là/O gì/O ạ/O?/O (What are the conditions for Yamada Scholarship?) In the following we present our models for solving the above sequence tagging
task, including a CRF-based model and more advanced models with deep neural
networks. The CRF-based model exploits a traditional but powerful sequence
learning method (i.e., conditional random fields) with manually designed features,
which can be used as a strong baseline to compare with our neural models.
Bạn đang xem 3. - QUESTION ANALYSIS TOWARDS A VIETNAMESE QUESTION ANSWERING SYSTEM IN THE EDUCATION DOMAIN