EAGLES logo

The EAGLES 7-step recipe
EAGLES Evaluation Working Group

April 1999

Introduction

The overall process of evaluation is the same whether comparing different systems or trying to evaluate a single candidate system. The ultimate question is whether it fits with what the customer of the evaluation wants or needs. In practice such requirements may not be set in stone before the evaluation starts and carrying out the evaluation may cause us to re-think (particularly in the case where no available system fits all the requirements, or the type of system provides extra functionalities which the evaluator was not aware of), however there will always be some idea of the requirements on a system before the evaluation begins. Given these requirements we must also have some way of judging whether a candidate system meets them. General requirements on the system are broken down into requirements on individual system attributes and for each of these attributes a measure and a method for obtaining that measure is defined. Each of these attributes is then measured and the results compared with the original requirements to evaluate how well the system fulfills them.

In this short document we present a brief overview of the 7 major steps necessary to carry out a successful evaluation of language technology systems or components. For more detailed discussion and exemplification see the EAGLES report.

The 7-step Recipe

1. Why is the evaluation being done?

2. Elaborate a task model

3. Define top level quality characteristics

4. Produce detailed requirements for the system under evaluation, on the basis of 2 and 3

5. Devise the metrics to be applied to the system for the requirements produced under 4.

6. Design the execution of the evaluation:

7. Execute the evaluation:




An Informal Example

Here we present a rather simplified informal example of a fictitious evaluation for a case where a translation agency is considering acquiring a terminology management tool, in order to gain better efficiency and consistency in the terminology which they translate. Following the first 5 steps outlined in the recipe might lead to the following sorts of answers. Although in real life the situation would be more complex and the subsequent requirements much more detailed.

1. Why is the evaluation being done?

2. Elaborate a Task Model

3. Define top level quality characteristics

4. Produce detailed requirements

5. Devise metrics to applied to the system

Some metrics (measures and methods) will involve simple inspection of the documentation accompanying a tool, for example, the character sets which are supported or the maximum size of a term database. The acceptable values for the language and size measures are already determined in the detailed requirements.

In other cases one should not rely on the manufacturer's own description. So for example to check how many people can access the tool at once and what they are allowed to do requires experimentation with the tool itself. A good score for the number of people who can efficiently work on the database at one time would be 8 (since this is total number of translators employed). A score of less than 3 would be unacceptable.

Other characteristics such as speed must be split up into smaller measurable sub-attributes, and involve a number of different factors which should be taken into consideration The time it takes to retrieve a term may be affected by the size of the database, and/or the number of other users working on the system at the same time, and we want to measure these effects as well. Thus we get different measures such as:

a. average time to retrieve a term from 100,000 term database (single user)

b. average time to retrieve a term from 100,000 term database (3 users)

c. average time to retrieve a term from 100,000 term database (5 users)

a. average time to save a term in a 100,000 term database (single user)

b. average time to save a term in a 100,000 term database (3 users)

c. average time to save a term in a 100,000 term database (5 users)

etc....


Sandra Manzi Last modified: Tue Feb 22 15:11:13 MET 2000