2.1 ERROR ANNOTATIONIMPLEMENTED AN ARTICLE ERROR DETECTION METHOD (N...

Question

3.2.1 Error Annotationimplemented an article error detection method (Na-gata et al., 2006) in the blog system as a trial at-We based our error annotation scheme on that usedtempt to keep the learners motivated since learnersin the NICT JLE corpus (Izumi et al., 2003a), whosedetailed description is readily available, for exam-are likely to become tired of doing the same exerciseple, in Izumi et al. (2005). In that annotationrepeatedly. To reduce this, the blog system high-scheme and accordingly in ours, errors are taggedlights where article errors exist after the essay hasusing an XML syntax; an error is annotated by tag-been submitted. The hope is that this might promptging a word or phrase that contains it. For in-the learners to write more accurately and to continuethe exercises. In the pre-experiments, the detectionstance, a tense error is annotated as follows: I v tnscrr=“made” make /v tns pies last year.did indeed seem to interest the learners and to pro-where v tns denotes a tense error in a verb. Itvide them with additional motivation. Consideringshould be emphasized that the error tags contain thethese results, we decided to include the fourth andinformation on correction together with error anno-fifth steps in the writing exercises when we createdtation. For instance, crr=“made” in the above ex-our learner corpus. At the same time, we should ofample denotes the correct form of the verb is made.course be aware that the use of error detection affectslearners’ writing. For example, it may change theFor missing word errors, error tags are placed wherea word or phrase is missing (e.g., My friends livePOS/chunking annotation scheme. Similar to the er-ror annotation scheme, we conducted a pilot study prp crr=“in” /prp these places.).to determine what modifications we needed to makeAs a pilot study, we applied the NICT JLE annota-to the Penn Treebank scheme. In the pilot study, wetion scheme to a learner corpus to reveal what mod-used the same learner corpus as in the pilot study forifications we needed to make. The learner corpusthe error annotation scheme.consisted of 455 essays (39,716 words) written byjunior high and high school students3. The follow-As a result of the pilot study, we found that theing describes the major modifications deemed nec-Penn Treebank tag set sufficed in most cases exceptessary as a result of the pilot study.for errors which learners made. Considering this, wedetermined a basic rule as follows: “Use the PennThe biggest difference between the NICT JLEcorpus and our targeted corpus is that the former isTreebank tag set and preserve the original texts asmuch as possible.” To handle such errors, we madespoken data and the latter is written data. This differ-ence inevitably requires several modifications to theseveral modifications and added two new POS tagsannotation scheme. In speech data, there are no er-(CE and UK) and another two for chunking (XP andPH), which are described below.rors in spelling and mechanics such as punctuationand capitalization. However, since such errors areA major modification concerns errors in mechan-not usually regarded as grammatical errors, we de-ics such as Tonight,we and beautifulhouse as alreadycided simply not to annotate them in our annotationexplained in Sect. 2. We use the symbol “-” to an-schemes.notate such cases. For instance, the above two ex-Another major difference is fragment errors.amples are annotated as follows: Tonight,we/NN-Fragments that do not form a complete sentence of-,-PRP and beautifulhouse/JJ-NN. Note that eachten appear in the writing of learners (e.g., I havePOS tag is hyphenated. It can also be usedmany books. Because I like reading.). In writtenfor annotating chunks in the same manner. Forlanguage, fragments can be regarded as a grammat-instance, Tonight,we is annotated as [NP-PH-NPical error. To annotate fragment errors, we added aTonight,we/NN-,-PRP ]. Here, the tag PH stands fornew tag f (e.g., I have many books.  f Because chunk label and denotes tokens which are notI like reading. /f ).normally chunked (cf., [NP Tonight/NN ] ,/, [NPAs discussed in Sect. 2, there is a trade-off be-we/PRP ]).tween the granularity of an annotation scheme andAnother major modification was required to han-the level of the difficulty in annotating errors. In ourdle grammatical errors. Essentially, POS/chunkingannotation scheme, we narrowed down the numbertags are assigned according to the surface informa-of tags to 22 from 46 in the original NICT JLE tagtion of the word in question regardless of the ex-set to facilitate the annotation; the 22 tags are shownistence of any errors. For example, There is ap-in Appendix A. The removed tags are merged intoples. is annotated as [NP There/EX ] [VP is/VBZthe tag for other. For instance, there are only three] [NP apples/NNS ] ./. Additionally, we define thetags for errors in nouns (number, lexis, and other) inCE4 tag to annotate errors in which learners use aour tag set whereas there are six in the NICT JLEword with a POS which is not allowed such as in Icorpus (inflection, number, case, countability, com-don’t success cooking. The CE tag encodes a POSplement, and lexis); the other tag ( n o ) coverswhich is obtained from the surface information to-the four removed tags.gether with the POS which would have been as-signed to the word if it were not for the error. For