2-4 KITA-AOYAMA, TOKYO, 107-0061 JAPANWHITTAKER,SHEINMAN @JIEM.CO.JP...
3-2-4 Kita-Aoyama, Tokyo, 107-0061 Japan
whittaker,sheinman
@jiem.co.jp
rnagata @ konan-u.ac.jp.
Abstract
it is often not available to the public or its access
is severely restricted. For example, the Cambridge
The availability of learner corpora, especiallyLearner Corpus, which is one of the largest error-
those which have been manually error-taggedtagged learner corpora, can only be used by authors
or shallow-parsed, is still limited. This meansand writers working for Cambridge University Press
that researchers do not have a common devel-and by members of staff at Cambridge ESOL.
opment and test set for natural language pro-Error-tagged learner corpora are crucial for devel-
cessing of learner English such as for gram-matical error detection. Given this back-oping and evaluating error detection/correction al-
ground, we created a novel learner corpusgorithms such as those described in (Rozovskaya
that was manually error-tagged and shallow-and Roth, 2010b; Chodorow and Leacock, 2000;
parsed. This corpus is available for researchChodorow et al., 2007; Felice and Pulman, 2008;
and educational purposes on the web. InHan et al., 2004; Han et al., 2006; Izumi et al.,
this paper, we describe it in detail together2003b; Lee and Seneff, 2008; Nagata et al., 2004;
with its data-collection method and annota-Nagata et al., 2005; Nagata et al., 2006; Tetreault et
tion schemes. Another contribution of thispaper is that we take the first step towardal., 2010b). This is one of the most active research
evaluating the performance of existing POS-areas in natural language processing of learner En-
tagging/chunking techniques on learner cor-glish. Because of the restrictions on their availabil-
pora using the created corpus. These contribu-ity, researchers have used their own learner corpora
tions will facilitate further research in relatedto develop and evaluate error detection/correction
areas such as grammatical error detection andmethods, which are often not commonly available
automated essay scoring.to other researchers. This means that the detec-
tion/correction performance of each existing method
1 Introduction
is not directly comparable as Rozovskaya and Roth
(2010a) and Tetreault et al. (2010a) point out. In
The availability of learner corpora is still somewhat
other words, we are not sure which methods achieve
limited despite the obvious usefulness of such data
the best performance. Commonly available error-
in conducting research on natural language process-
tagged learner corpora are therefore essential to fur-
ing of learner English in recent years. In particular,
ther research in this area.
learner corpora tagged with grammatical errors are
For similar reasons, to the best of our knowledge,
rare because of the difficulties inherent in learner
there exists no such learner corpus that is manually
corpus creation as will be described in Sect. 2. As
shown in Table 1, error-tagged learner corpora are
shallow-parsed and which is also publicly available,
very few among existing learner corpora (see Lea-
unlike, say, native-speaker corpora such as the Penn
cock et al. (2010) for a more detailed discussion
Treebank. Such a comparison brings up another cru-
of learner corpora). Even if data is error-tagged,
cial question: “Do existing POS taggers and chun-
1210
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 1210–1219,Name Error-tagged Parsed Size (words) Availability
Cambridge Learner Corpus Yes No 30 million No
CLEC Corpus Yes No 1 million Partially
ETLC Corpus Partially No 2 million Not Known
HKUST Corpus Yes No 30 million No
ICLE Corpus (Granger et al., 2009) No No 3.7 million+ Yes
JEFLL Corpus (Tono, 2000) No No 1 million Partially
Longman Learners’ Corpus No No 10 million Not Known
NICT JLE Corpus (Izumi et al., 2003a) Partially No 2 million Partially
Polish Learner English Corpus No No 0.5 million No
Janus Pannoius University Learner Corpus No No 0.4 million Not Known
InAvailability,Yesdenotes that the full texts of the corpus is available to the public.Partiallydenotes that it is acces-sible through specially-made interfaces such as a concordancer. The information in this table may not be consistentbecause many of the URLs of the corpora give only sparse information about them.Table 1: Learner corpus list.kers work on learner English as well as on edited text
and educational purposes on the web
1