2.1 ERROR ANNOTATIONIMPLEMENTED AN ARTICLE ERROR DETECTION METHOD (N...
3.2.1 Error Annotation
implemented an article error detection method (Na-
gata et al., 2006) in the blog system as a trial at-
We based our error annotation scheme on that used
tempt to keep the learners motivated since learners
in the NICT JLE corpus (Izumi et al., 2003a), whose
detailed description is readily available, for exam-
are likely to become tired of doing the same exercise
ple, in Izumi et al. (2005). In that annotation
repeatedly. To reduce this, the blog system high-
scheme and accordingly in ours, errors are tagged
lights where article errors exist after the essay has
using an XML syntax; an error is annotated by tag-
been submitted. The hope is that this might prompt
ging a word or phrase that contains it. For in-
the learners to write more accurately and to continue
the exercises. In the pre-experiments, the detection
stance, a tense error is annotated as follows: I
v tns
crr=“made”
make
/v tns
pies last year.
did indeed seem to interest the learners and to pro-
where v tns denotes a tense error in a verb. It
vide them with additional motivation. Considering
should be emphasized that the error tags contain the
these results, we decided to include the fourth and
information on correction together with error anno-
fifth steps in the writing exercises when we created
tation. For instance, crr=“made” in the above ex-
our learner corpus. At the same time, we should of
ample denotes the correct form of the verb is made.
course be aware that the use of error detection affects
learners’ writing. For example, it may change the
For missing word errors, error tags are placed where
a word or phrase is missing (e.g., My friends live
POS/chunking annotation scheme. Similar to the er-
ror annotation scheme, we conducted a pilot study
prp crr=“in”
/prp
these places.).
to determine what modifications we needed to make
As a pilot study, we applied the NICT JLE annota-
to the Penn Treebank scheme. In the pilot study, we
tion scheme to a learner corpus to reveal what mod-
used the same learner corpus as in the pilot study for
ifications we needed to make. The learner corpus
the error annotation scheme.
consisted of 455 essays (39,716 words) written by
junior high and high school students
3
. The follow-
As a result of the pilot study, we found that the
ing describes the major modifications deemed nec-
Penn Treebank tag set sufficed in most cases except
essary as a result of the pilot study.
for errors which learners made. Considering this, we
determined a basic rule as follows: “Use the Penn
The biggest difference between the NICT JLE
corpus and our targeted corpus is that the former is
Treebank tag set and preserve the original texts as
much as possible.” To handle such errors, we made
spoken data and the latter is written data. This differ-
ence inevitably requires several modifications to the
several modifications and added two new POS tags
annotation scheme. In speech data, there are no er-
(CE and UK) and another two for chunking (XP and
PH), which are described below.
rors in spelling and mechanics such as punctuation
and capitalization. However, since such errors are
A major modification concerns errors in mechan-
not usually regarded as grammatical errors, we de-
ics such as Tonight,we and beautifulhouse as already
cided simply not to annotate them in our annotation
explained in Sect. 2. We use the symbol “-” to an-
schemes.
notate such cases. For instance, the above two ex-
Another major difference is fragment errors.
amples are annotated as follows: Tonight,we/NN-
Fragments that do not form a complete sentence of-
,-PRP and beautifulhouse/JJ-NN. Note that each
ten appear in the writing of learners (e.g., I have
POS tag is hyphenated. It can also be used
many books. Because I like reading.). In written
for annotating chunks in the same manner. For
language, fragments can be regarded as a grammat-
instance, Tonight,we is annotated as [NP-PH-NP
ical error. To annotate fragment errors, we added a
Tonight,we/NN-,-PRP ]. Here, the tag PH stands for
new tag
f
(e.g., I have many books.
f
Because
chunk label and denotes tokens which are not
I like reading.
/f
).
normally chunked (cf., [NP Tonight/NN ] ,/, [NP
As discussed in Sect. 2, there is a trade-off be-
we/PRP ]).
tween the granularity of an annotation scheme and
Another major modification was required to han-
the level of the difficulty in annotating errors. In our
dle grammatical errors. Essentially, POS/chunking
annotation scheme, we narrowed down the number
tags are assigned according to the surface informa-
of tags to 22 from 46 in the original NICT JLE tag
tion of the word in question regardless of the ex-
set to facilitate the annotation; the 22 tags are shown
istence of any errors. For example, There is ap-
in Appendix A. The removed tags are merged into
ples. is annotated as [NP There/EX ] [VP is/VBZ
the tag for other. For instance, there are only three
] [NP apples/NNS ] ./. Additionally, we define the
tags for errors in nouns (number, lexis, and other) in
CE
4
tag to annotate errors in which learners use a
our tag set whereas there are six in the NICT JLE
word with a POS which is not allowed such as in I
corpus (inflection, number, case, countability, com-
don’t success cooking. The CE tag encodes a POS
plement, and lexis); the other tag (
n o