1 EVALUATION METHODOLOGY RESEARCHERS (CABEZAS AND RESNIK 2005, CARPU...

6.1 Evaluation Methodology

researchers (Cabezas and Resnik 2005, Carpuat

For evaluation, we used human judgments of the

and Wu 2007) provide abundant evidence that

modified and original MT. We did not have

rich context features are useful in MT tasks.

reference translations for the data used by our

Carpuat and Wu (2007) tried to integrate a

question-answering system and thus, could not

Phrase Sense Disambiguation (PSD) model into

use metrics such as TER or Bleu. Moreover, at

their Chinese-English SMT system and they

best, TER or Bleu score would increase by a

found that the POS tag preceding a given phrase,

small amount and that is only if we select the

the POS tag following the phrase and bag-of-

same main verb in the same position as the

words are the three most useful features.

reference. Critically, we also know that a

Following their approach, we use the word

missing main verb can cause major problems

preceding and the word following a verb as the

with comprehension. Thus, readers could better

context features.

determine if the modified sentence better

The Static and Dynamic Verb Phrase Tables

captured the meaning of the source sentence. We

provide us with MT examples to translate a

also evaluated relevance of a sentence to a query

VTG. The system first references the Dynamic

before and after modification.

Verb Phrase Table as it is more likely to yield a

We recruited 13 Chinese native speakers who

good translation. If the record is not found, the

are also proficient in English to judge MT

Static one is referenced. If it is not found in

quality. Native English speakers cannot tell

either, the given VTG will not be processed. No

which translation is better since they do not

matter which table is referenced, the following

understand the meaning of the original Chinese.

Naive Bayes equation is applied to obtain the

To judge relevance to the query, we used native

translation of a given VTG.

English speakers.

=

fw

'

arg

t

)

,

|

(

max

pw

P

Each modified sentence was evaluated by

k

t

three people. They were shown the Chinese

+

(log

))

arg

log

sentence and two translations, the original MT

and the modified one. Evaluators did not know

pw, fw and t

k

respectively represent the

which MT sentence was modified. They were

preceding source word, the following source

asked to decide which sentence is a better

word and a translation candidate of a VTG.

translation, after reading the Chinese sentence.

An evaluator also had the option of answering