2.1 ERROR ANNOTATIONIMPLEMENTED AN ARTICLE ERROR DETECTION METHOD (N...

3.2.1 Error Annotation

implemented an article error detection method (Na-

gata et al., 2006) in the blog system as a trial at-

We based our error annotation scheme on that used

tempt to keep the learners motivated since learners

in the NICT JLE corpus (Izumi et al., 2003a), whose

detailed description is readily available, for exam-

are likely to become tired of doing the same exercise

ple, in Izumi et al. (2005). In that annotation

repeatedly. To reduce this, the blog system high-

scheme and accordingly in ours, errors are tagged

lights where article errors exist after the essay has

using an XML syntax; an error is annotated by tag-

been submitted. The hope is that this might prompt

ging a word or phrase that contains it. For in-

the learners to write more accurately and to continue

the exercises. In the pre-experiments, the detection

stance, a tense error is annotated as follows: I

v tns

crr=“made”

make

/v tns

pies last year.

did indeed seem to interest the learners and to pro-

where v tns denotes a tense error in a verb. It

vide them with additional motivation. Considering

should be emphasized that the error tags contain the

these results, we decided to include the fourth and

information on correction together with error anno-

fifth steps in the writing exercises when we created

tation. For instance, crr=“made” in the above ex-

our learner corpus. At the same time, we should of

ample denotes the correct form of the verb is made.

course be aware that the use of error detection affects

learners’ writing. For example, it may change the

For missing word errors, error tags are placed where

a word or phrase is missing (e.g., My friends live

POS/chunking annotation scheme. Similar to the er-

ror annotation scheme, we conducted a pilot study

prp crr=“in”

/prp

these places.).

to determine what modifications we needed to make

As a pilot study, we applied the NICT JLE annota-

to the Penn Treebank scheme. In the pilot study, we

tion scheme to a learner corpus to reveal what mod-

used the same learner corpus as in the pilot study for

ifications we needed to make. The learner corpus

the error annotation scheme.

consisted of 455 essays (39,716 words) written by

junior high and high school students

3

. The follow-

As a result of the pilot study, we found that the

ing describes the major modifications deemed nec-

Penn Treebank tag set sufficed in most cases except

essary as a result of the pilot study.

for errors which learners made. Considering this, we

determined a basic rule as follows: “Use the Penn

The biggest difference between the NICT JLE

corpus and our targeted corpus is that the former is

Treebank tag set and preserve the original texts as

much as possible.” To handle such errors, we made

spoken data and the latter is written data. This differ-

ence inevitably requires several modifications to the

several modifications and added two new POS tags

annotation scheme. In speech data, there are no er-

(CE and UK) and another two for chunking (XP and

PH), which are described below.

rors in spelling and mechanics such as punctuation

and capitalization. However, since such errors are

A major modification concerns errors in mechan-

not usually regarded as grammatical errors, we de-

ics such as Tonight,we and beautifulhouse as already

cided simply not to annotate them in our annotation

explained in Sect. 2. We use the symbol “-” to an-

schemes.

notate such cases. For instance, the above two ex-

Another major difference is fragment errors.

amples are annotated as follows: Tonight,we/NN-

Fragments that do not form a complete sentence of-

,-PRP and beautifulhouse/JJ-NN. Note that each

ten appear in the writing of learners (e.g., I have

POS tag is hyphenated. It can also be used

many books. Because I like reading.). In written

for annotating chunks in the same manner. For

language, fragments can be regarded as a grammat-

instance, Tonight,we is annotated as [NP-PH-NP

ical error. To annotate fragment errors, we added a

Tonight,we/NN-,-PRP ]. Here, the tag PH stands for

new tag

f

(e.g., I have many books.

f

Because

chunk label and denotes tokens which are not

I like reading.

/f

).

normally chunked (cf., [NP Tonight/NN ] ,/, [NP

As discussed in Sect. 2, there is a trade-off be-

we/PRP ]).

tween the granularity of an annotation scheme and

Another major modification was required to han-

the level of the difficulty in annotating errors. In our

dle grammatical errors. Essentially, POS/chunking

annotation scheme, we narrowed down the number

tags are assigned according to the surface informa-

of tags to 22 from 46 in the original NICT JLE tag

tion of the word in question regardless of the ex-

set to facilitate the annotation; the 22 tags are shown

istence of any errors. For example, There is ap-

in Appendix A. The removed tags are merged into

ples. is annotated as [NP There/EX ] [VP is/VBZ

the tag for other. For instance, there are only three

] [NP apples/NNS ] ./. Additionally, we define the

tags for errors in nouns (number, lexis, and other) in

CE

4

tag to annotate errors in which learners use a

our tag set whereas there are six in the NICT JLE

word with a POS which is not allowed such as in I

corpus (inflection, number, case, countability, com-

don’t success cooking. The CE tag encodes a POS

plement, and lexis); the other tag (

n o

) covers

which is obtained from the surface information to-

the four removed tags.

gether with the POS which would have been as-

signed to the word if it were not for the error. For