Material from the workshop held at the LREC 2002 Conference, 27 May 2002
Machine Translation Evaluation:
Human Evaluators Meet Automated Metrics
- Workbook of the workshop
- Download the PDF version (250 kb) or the
PS.GZ version (387 kb). This document was
distributed at the workshop, and is also available on the LREC 2002 CD-ROM.
- Talks presented at the workshop
- Andrei Popescu-Belis: workshop introduction and conclusions (PS.GZ | PDF)
Florence Reeder: human-based evaluation metrics (PS.GZ | PDF)
George Doddington: the NIST automated measure and its relation to IBM's BLEU (PS.GZ | PDF)
- Evaluation reports presented at the workshop
- Bonnie Dorr (PS.GZ | PDF)
Cristina Vertan (not yet available)
Marianne Dabbadie & Widad Mustafa El Hadi (PS.GZ | PDF)
George Doddington (PS.GZ | PDF)
Florence Reeder (see end of previous talk)
Eva Forsbom (PS.GZ | PDF)
- Evaluation reports that could not be presented
- Michelle Vanni (PS.GZ | PDF)
- Mairead McCarthy et al. (PS.GZ | PDF)
The test data consists in two sets of translations of two articles,
originally in French. We provide below, in text format, the two
original articles, a reference translation for those who do not
speak French (definitely not "the perfect translation", but close
enough to the text), and a dozen translations of variable quality.
You can download a ZIP archive with all text files, as well as a text file containing all the individual test files
(useful for printing), either in RTF
or in PS.GZ format.
"Children and drugs"
Excerpts from the brochure "Prévenir ses enfants des problèmes de
drogue", Institut Suisse de Prévention de l'Alcoolisme et Autres
Toxicomanies (ISPA), 24 p., 1999. Available for free, order at
"Taliban and Women"
Micheline Centlivres-Demont, "Hommes combattants, femmes discrètes :
aspects des résistances subalternes dans le conflit et l'exil afghan"
(p.169-182, excerpt at p. 178). In "Hommes armés, femmes aguerries :
rapports de genre en situations de conflit armé", Fenneke Reysoo, editor,
DDC/Unesco/IUED, Geneva, 2001, 250 p.|
Proceedings of a colloquium held at the Institut Universitaire des Études
du Développement, Geneva, 23-24 January 2001.
Available freely at the IUED's press service or on the IUED website).
Online references to evaluation metrics
Three reports gather a great number of proposed metrics, mainly human
Two automatic/automatable metrics
- Georges Van Slype (1979) - Critical Study of Methods for
Evaluating the Quality of Machine Translation. Final Report, Bureau Marcel van Dijk / European Commission, Brussels.
- Falkedal, K. 1994. Evaluation methods for machine
translation systems: An historical overview and critical
account, ISSCO draft report, University of Geneva, Geneva.
Not available yet
- Proceedings of the workshop "MT Evaluation: Who Did What To Whom".
In conjunction with Machine Translation Summit VIII, Saturday, September 22nd, 2001, Santiago de Compostela, Spain.
- Bleu: a Method for Automatic Evaluation of Machine Translation. Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. Published as IBM Report RC22176 in 2001.
Human evaluations of machine translation are extensive but
expensive. Human evaluations can take months to finish and involve
human labor that can not be reused. We propose a method of automatic
machine translation evaluation that is quick, inexpensive, and
language-independent, that correlates highly with human evaluation,
and that has little marginal cost per run. We present this method as
an automated understudy to skilled human judges which substitutes for
them when there is need for quick or frequent evaluations.
- An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research.Sonja Niessen, Franz Josef Och, Gregor Leusch, Hermann Ney. Proc. 2nd International Conference on Language Resources and Evaluation, pp. 39-45, Athens, Greece, May 2000.
In this paper we present a tool for the evaluation of translation
quality. First, the typical requirements of such a tool in the
framework of machine translation (MT) research are discussed. We
define evaluation criteria which are more adequate than pure edit
distance and we describe how the measurement along these quality
criteria is performed semi-automatically in a fast, convenient and
above all consistent way using our tool and the corresponding
graphical user interface.
This web page, the test data and the metrics have been gathered by
together with the other organizers of the workshop.
Last updated on June 6, 2002