Material from the workshop held at the LREC 2002 Conference, 27 May 2002

Machine Translation Evaluation:
Human Evaluators Meet Automated Metrics

 


Workshop material

Workbook of the workshop
Download the PDF version (250 kb) or the PS.GZ version (387 kb). This document was distributed at the workshop, and is also available on the LREC 2002 CD-ROM.
Talks presented at the workshop
Andrei Popescu-Belis: workshop introduction and conclusions (PS.GZ | PDF)
Florence Reeder: human-based evaluation metrics (PS.GZ | PDF)
George Doddington: the NIST automated measure and its relation to IBM's BLEU (PS.GZ | PDF)
Evaluation reports presented at the workshop
Bonnie Dorr (PS.GZ | PDF)
Cristina Vertan (not yet available)
Marianne Dabbadie & Widad Mustafa El Hadi (PS.GZ | PDF)
George Doddington (PS.GZ | PDF)
Florence Reeder (see end of previous talk)
Eva Forsbom (PS.GZ | PDF)
Evaluation reports that could not be presented
Michelle Vanni (PS.GZ | PDF)
Mairead McCarthy et al. (PS.GZ | PDF)

Test data

The test data consists in two sets of translations of two articles, originally in French. We provide below, in text format, the two original articles, a reference translation for those who do not speak French (definitely not "the perfect translation", but close enough to the text), and a dozen translations of variable quality. You can download a ZIP archive with all text files, as well as a text file containing all the individual test files (useful for printing), either in RTF or in PS.GZ format.

"Children and drugs" Excerpts from the brochure "Prévenir ses enfants des problèmes de drogue", Institut Suisse de Prévention de l'Alcoolisme et Autres Toxicomanies (ISPA), 24 p., 1999. Available for free, order at http://www.sfa-ispa.ch. "Taliban and Women" Micheline Centlivres-Demont, "Hommes combattants, femmes discrètes : aspects des résistances subalternes dans le conflit et l'exil afghan" (p.169-182, excerpt at p. 178). In "Hommes armés, femmes aguerries : rapports de genre en situations de conflit armé", Fenneke Reysoo, editor, DDC/Unesco/IUED, Geneva, 2001, 250 p.
Proceedings of a colloquium held at the Institut Universitaire des Études du Développement, Geneva, 23-24 January 2001.
Available freely at the IUED's press service or on the IUED website).

Online references to evaluation metrics

Three reports gather a great number of proposed metrics, mainly human
  • Georges Van Slype (1979) - Critical Study of Methods for Evaluating the Quality of Machine Translation. Final Report, Bureau Marcel van Dijk / European Commission, Brussels.
  • Falkedal, K. 1994. Evaluation methods for machine translation systems: An historical overview and critical account, ISSCO draft report, University of Geneva, Geneva.
    Not available yet
  • Proceedings of the workshop "MT Evaluation: Who Did What To Whom". In conjunction with Machine Translation Summit VIII, Saturday, September 22nd, 2001, Santiago de Compostela, Spain.
Two automatic/automatable metrics
  • Bleu: a Method for Automatic Evaluation of Machine Translation. Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. Published as IBM Report RC22176 in 2001.
    Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations.
  • An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research.Sonja Niessen, Franz Josef Och, Gregor Leusch, Hermann Ney. Proc. 2nd International Conference on Language Resources and Evaluation, pp. 39-45, Athens, Greece, May 2000.
    In this paper we present a tool for the evaluation of translation quality. First, the typical requirements of such a tool in the framework of machine translation (MT) research are discussed. We define evaluation criteria which are more adequate than pure edit distance and we describe how the measurement along these quality criteria is performed semi-automatically in a fast, convenient and above all consistent way using our tool and the corresponding graphical user interface.

This web page, the test data and the metrics have been gathered by Andrei Popescu-Belis together with the other organizers of the workshop.
Last updated on June 6, 2002