2-4 KITA-AOYAMA, TOKYO, 107-0061 JAPANWHITTAKER,SHEINMAN @JIEM.CO.JP...

3-2-4 Kita-Aoyama, Tokyo, 107-0061 Japan

whittaker,sheinman

@jiem.co.jp

rnagata @ konan-u.ac.jp.

Abstract

it is often not available to the public or its access

is severely restricted. For example, the Cambridge

The availability of learner corpora, especially

Learner Corpus, which is one of the largest error-

those which have been manually error-tagged

tagged learner corpora, can only be used by authors

or shallow-parsed, is still limited. This means

and writers working for Cambridge University Press

that researchers do not have a common devel-

and by members of staff at Cambridge ESOL.

opment and test set for natural language pro-

Error-tagged learner corpora are crucial for devel-

cessing of learner English such as for gram-matical error detection. Given this back-

oping and evaluating error detection/correction al-

ground, we created a novel learner corpus

gorithms such as those described in (Rozovskaya

that was manually error-tagged and shallow-

and Roth, 2010b; Chodorow and Leacock, 2000;

parsed. This corpus is available for research

Chodorow et al., 2007; Felice and Pulman, 2008;

and educational purposes on the web. In

Han et al., 2004; Han et al., 2006; Izumi et al.,

this paper, we describe it in detail together

2003b; Lee and Seneff, 2008; Nagata et al., 2004;

with its data-collection method and annota-

Nagata et al., 2005; Nagata et al., 2006; Tetreault et

tion schemes. Another contribution of thispaper is that we take the first step toward

al., 2010b). This is one of the most active research

evaluating the performance of existing POS-

areas in natural language processing of learner En-

tagging/chunking techniques on learner cor-

glish. Because of the restrictions on their availabil-

pora using the created corpus. These contribu-

ity, researchers have used their own learner corpora

tions will facilitate further research in related

to develop and evaluate error detection/correction

areas such as grammatical error detection and

methods, which are often not commonly available

automated essay scoring.

to other researchers. This means that the detec-

tion/correction performance of each existing method

1 Introduction

is not directly comparable as Rozovskaya and Roth

(2010a) and Tetreault et al. (2010a) point out. In

The availability of learner corpora is still somewhat

other words, we are not sure which methods achieve

limited despite the obvious usefulness of such data

the best performance. Commonly available error-

in conducting research on natural language process-

tagged learner corpora are therefore essential to fur-

ing of learner English in recent years. In particular,

ther research in this area.

learner corpora tagged with grammatical errors are

For similar reasons, to the best of our knowledge,

rare because of the difficulties inherent in learner

there exists no such learner corpus that is manually

corpus creation as will be described in Sect. 2. As

shown in Table 1, error-tagged learner corpora are

shallow-parsed and which is also publicly available,

very few among existing learner corpora (see Lea-

unlike, say, native-speaker corpora such as the Penn

cock et al. (2010) for a more detailed discussion

Treebank. Such a comparison brings up another cru-

of learner corpora). Even if data is error-tagged,

cial question: “Do existing POS taggers and chun-

1210

Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 1210–1219,

Name Error-tagged Parsed Size (words) Availability

Cambridge Learner Corpus Yes No 30 million No

CLEC Corpus Yes No 1 million Partially

ETLC Corpus Partially No 2 million Not Known

HKUST Corpus Yes No 30 million No

ICLE Corpus (Granger et al., 2009) No No 3.7 million+ Yes

JEFLL Corpus (Tono, 2000) No No 1 million Partially

Longman Learners’ Corpus No No 10 million Not Known

NICT JLE Corpus (Izumi et al., 2003a) Partially No 2 million Partially

Polish Learner English Corpus No No 0.5 million No

Janus Pannoius University Learner Corpus No No 0.4 million Not Known

InAvailability,Yesdenotes that the full texts of the corpus is available to the public.Partiallydenotes that it is acces-sible through specially-made interfaces such as a concordancer. The information in this table may not be consistentbecause many of the URLs of the corpora give only sparse information about them.Table 1: Learner corpus list.

kers work on learner English as well as on edited text

and educational purposes on the web

1

. Another

such as newspaper articles?” Nobody really knows

contribution of this paper is that we take the first

the answer to the question. The only exception in the

step toward answering the question about the per-

literature is the work by Tetreault et al. (2010b) who

formance of existing POS-tagging/chunking tech-

niques on learner data. We report and discuss the

evaluated parsing performance in relation to prepo-

results in Sect. 5.

sitions. Nevertheless, a great number of researchers

have used existing POS taggers and chunkers to ana-

2 Difficulties in Learner Corpus Creation

lyze the writing of learners of English. For instance,

error detection methods normally use a POS tagger

In addition to the common difficulties in creating

and/or a chunker in the error detection process. It is

any corpus, learner corpus creation has its own dif-

therefore possible that a major cause of false pos-

ficulties. We classify them into the following four

itives and negatives in error detection may be at-

categories of the difficulty in:

tributed to errors in POS-tagging and chunking. In

corpus linguistics, researchers (Aarts and Granger,