Machine Translation Summit IX
Workshop on Machine Translation Evaluation

Towards Systematizing MT Evaluation

Saturday, September 27, 2003
9:00 am - 5:00 pm
New Orleans, Louisiana, USA

Final Program

Note: regular talks are 20' long plus 20' for discussion and 5' for speaker change.

TimeTitle of presentationAuthor(s)
9:00-9:15 Introduction to the workshop: Systematizing MT Evaluation Organizers
9:15-10:00 Invited talk: Cross-domain Study of N-gram Co-occurrence Metrics Chin-Yew Lin (USC/ISI, USA)
10:00-10:30 Break  
10:30-11:15 Granularity in MT Evaluation Florence Reeder (MITRE, USA) and John S. White (Northrop-Grumman, USA)
11:15-12:00 Training a Super Model Look-Alike: Featuring Edit Distance, N-Gram Occurrence, and One Reference Translation Eva Forsbom (Uppsala University, Sweden)
12:00-13:30 Lunch break  
13:30-14:15 Task-based MT Evaluation: Tackling Software, Experimental Design, & Statistical Models Calandra Tate (University of Maryland, USA), Sooyon Lee (ARTI, Inc., USA), and Clare R. Voss (Army Research Laboratory, USA)
14:15-15:00 Evaluation Techniques Applied to Domain Tuning of MT Lexicons Necip Fazil Ayan, Bonnie J. Dorr, Okan Kolakonnie (University of Maryland, USA)
15:00-15:30 Break  
15:30-16:15 Considerations of Methodology and Human Factors in Rating a Suite of Translated Sentences Leslie Barrett (Transclick, Inc., USA)
16:15-17:00 Pragmatics-based Translation and MT Evaluation David Farwell and Stephen Helmreich (New Mexico State University, USA)


Estimating the quality of any machine-translation system accurately is only possible if the evaluation methodology is robust and systematic. The Evaluation Work Group of the NSF and EU-funded ISLE project has created a taxonomy that relates situations and measures for a variety of MT applications. The Framework for MT Evaluation in ISLE (FEMTI) is now available online at

The effort of matching these measures correctly with their appropriate evaluation tasks, however, is an area that needs further attention. For example, what effect do user needs have on the functionality characteristics specified in the FEMTI guidelines? To what extent are there unseen relationships in the branches of the taxonomy? How can we judge when a given evaluation measure is appropriate? Issues that come to bear on this question are the automation of MT evaluation, the extension to MT applications such as automated speech-translation, and the evaluation of the very training corpora that an MT system relies on to improve output quality.

This workshop welcomes papers for 30-minute presentations on the comparison between MT evaluation measures, studies of the behavior of individual measures (i.e., meta-evaluation), new uses for measures, analysis of MT evaluation tasks with respect to measures, and related topics on this theme. We solicit submissions to the workshop that address some of the following issues, however any other topic related to MT Testing and Evaluation is also acceptable.

Machine Translation Evaluation Measures
  • Use of existing measures in the ISLE hierarchy (FEMTI guidelines)
  • New measures and their uses
  • Matching evaluation requirements (e.g., translation tasks, user profiles) with measures
  • Effects of combining measures
Evaluation Measures and Languages
  • Is a metric's effectiveness language independent?
  • Counting grammatical features for evaluation
Evaluation and Domains
  • Measures for spoken Language translation
  • Domain-specific evaluation techniques
  • Using measures to evaluate the quality of a training corpus for a given task
Automation vs. Human Testing
  • Which measures are suitable for automation?
  • Human/machine scoring comparisons
  • Human tester agreement: which measures fare best?

Important Dates

Submission Instructions

Program Committee

  • Bonnie Dorr (University of Maryland)
  • Eduard Hovy (Information Sciences Institute, University of Southern California)
  • Maghi King (ISSCO/TIM/ETI, University of Geneva)
  • Bente Maegaard (Center for Sprogteknologi, Copenhagen, Denmark)
  • Keith Miller (MITRE Corp.)
  • Martha Palmer (University of Pennsylvania)
  • Ted Pedersen (Univesity of Minnesota, Duluth)
  • Andrei Popescu-Belis (ISSCO/TIM/ETI, University of Geneva)
  • Florence Reeder (MITRE Corp.)
  • Nancy Underwood (ISSCO/TIM/ETI, University of Geneva)
  • Michelle Vanni (Army Research Laboratory, USA)

Andrei Popescu-Belis
