1 TRAINING DATATEM BECAUSE EACH QA SYSTEM USES ITS OWN QUESTION-TYPE...
2.1 Training Data
tem because each QA system uses its own question-
type system. It is very typical in the course of sys-
Document Set Japanese newspaper articles of The
Mainichi Newspaper published in 1995.
tem development to redesign the question-type sys-
tem in order to improve system performance. This
Question/Answer Set We used the CRL
1
QA
inevitably leads to revision of a large-scale training
Data (Sekine et al., 2002). This dataset com-
dataset, which requires a heavy workload.
prises 2,000 Japanese questions with correct
For example, assume that you have to develop a
answers as well as question types and IDs of
Chinese or Greek QA system and have 10,000 pairs
articles that contain the answers. Each ques-
of question and answers. You have to manually clas-
tion is categorized as one of 115 hierarchically
sify the questions according to your own question-
classified question types.
type system. In addition, you have to annotate the
tags of the question types to large-scale Chinese or
The document set is used not only in the training
Greek documents. If you wanted to redesign the
phase but also in the execution phrase.
question type
ORGANIZATIONto three categories,
Although the CRL QA Data contains question
types, the information of question types are not used
COMPANY,
SCHOOL, and
OTHER ORGANIZATION,
then the
ORGANIZATIONtags in the annotated doc-
for the training. This is because more than the 60%
ument set would need to be manually revisited and
of question types have fewer than 10 questions as
revised.
examples (Table 1). This means it is very unlikely
that we can train a QA system that can handle this
To solve this problem, this paper regards Ques-
60% due to data sparseness.
2