The EAGLES 7-step recipe
EAGLES Evaluation Working Group
April 1999
Introduction
The overall process of evaluation is the same whether comparing
different systems or trying to evaluate a single candidate system.
The ultimate question is whether it fits with what the customer of
the evaluation wants or needs. In practice such requirements may
not be set in stone before the evaluation starts and carrying out
the evaluation may cause us to re-think (particularly in the case
where no available system fits all the requirements, or the type of
system provides extra functionalities which the evaluator was not
aware of), however there will always be some idea of the
requirements on a system before the evaluation begins. Given these
requirements we must also have some way of judging whether a
candidate system meets them. General requirements on the system are
broken down into requirements on individual system attributes and
for each of these attributes a measure and a method for obtaining
that measure is defined. Each of these attributes is then measured
and the results compared with the original requirements to evaluate
how well the system fulfills them.
In this short document we present a brief overview of the 7
major steps necessary to carry out a successful evaluation of
language technology systems or components. For more detailed
discussion and exemplification see the EAGLES report.
The 7-step Recipe
1. Why is the evaluation being done?
- What is the purpose of the evaluation? Do all parties involved
have the same understanding of the purpose?
- What exactly is being evaluated? Is it a system or a system
component? A system in isolation or a system in a specific context
of use? Where are the boundaries of the system?
2. Elaborate a task model
- Identify all relevant roles and agents
- What is the system going to be used for?
- Who will use it? What will they do with it? What are these
people like?
3. Define top level quality characteristics
- What features of the system need to be evaluated? Are they all
equally important?
4. Produce detailed requirements for the system under
evaluation, on the basis of 2 and 3
- For each feature which has been identified as important, can a
valid and reliable way be found of measuring how the object being
evaluated performs with respect to that feature? If not, then the
features have to be broken down in a valid way, into sub-attributes
which are measurable. This point has to be repeated until a point
is reached where the attributes are measurable.
5. Devise the metrics to be applied to the system for the
requirements produced under 4.
- Both measure and method for obtaining that measure have to be
defined for each attribute.
- For each measurable attribute, what will count as a good score,
a satisfactory score or an unsatisfactory score given the task
model (2)? Where are the cut off points?
- Usually, an attribute has more than one sub-attributes. How are
the values of the different sub-attributes combined to a value for
the mother node in order to reflect their relative importance
(again given the task model)?
6. Design the execution of the evaluation:
- Develop test materials to support the testing of the
object.
- Who will actually carry out the different measurements? When?
In what circumstances? What form will the end result take?
7. Execute the evaluation:
- Make measurement.
- Compare with the previously determined satisfaction
ratings.
- Summarize the results in an evaluation report, cf. point
1.
An Informal Example
Here we present a rather simplified informal example of a
fictitious evaluation for a case where a translation agency is
considering acquiring a terminology management tool, in order to
gain better efficiency and consistency in the terminology which
they translate. Following the first 5 steps outlined in the recipe
might lead to the following sorts of answers. Although in real life
the situation would be more complex and the subsequent requirements
much more detailed.
1. Why is the evaluation being done?
- What is the purpose of the evaluation? To choose
the most suitable terminology management tool, for both the
translators and the terminologists to use. Whilst the manager is
looking for efficiency and cost savings the individual translators
and terminologists are hoping for a way to make their work more
satisfying.
- What exactly is being evaluated? Terminology
management tools which can be accessed via a network.
2. Elaborate a Task Model
- What is the system going to be used for?
Looking
up terms during translation, storing newly translated terms, and
ensuring consistency within and across translations.
- Who will use it? what will they do with it? What are
these people like?
Technical translators with an average of
seven years experience in translating technical texts from English to
French, Spanish and Japanese, will use it during translation to
look up terms and their translations. The terminologist will use it
to build up and organise terminology and validate the accuracy and
consistency of the terminology available to the translators.
3. Define top level quality characteristics
- what features of the system need to be evaluated? are
they equally important?
- Languages: The tool must be able to support all the relevant
languages, otherwise it will be of no use.
- Access: How many people can access the tool at one time? What
can they do with it.
- Size: How many terms (and their translations) can be
stored?
- Consistency: Does the tool have facilities for ensuring that
for each term only one translation per target language is
entered?
- Speed: How fast is terminology look-up and updating? Whilst
look-up and updating should not take an unreasonable amount of time
this characteristic may not be so important as the preceeding
ones.
4. Produce detailed requirements
- Languages:The tool must be able to support all of English,
French, Spanish and Japanese character sets.
- Access: The tool must allow for at least 3 translators to look
up terms at one time. It must not allow different translators to
automatically update and thus overwrite translations of existing
terms which have not been approved by the terminologist. The tool
should allow for different types of access for different
users.
- Size:The agency wants to be able to store and access up to a
million terms in the next five years.
- Consistency: The tool should have facilities for ensuring that
for each term only one translation per target language is stored.
The tool should allow for completely new terms to be added during
translation and marked as such to allow the terminologist to
approve them.
- Speed: Terminology look-up and updating must be quicker than
the current procedure using index cards. However there could be a
trade-off here, if the improvement in consistency is very great
(thus reducing the average post-editing time) then speed of look-up
and updating may be less important. This is one of the attributes
which needs to be split up into measurable sub-attributes for the
two different processes. (see below)
5. Devise metrics to applied to the system
Some metrics (measures and methods) will involve simple
inspection of the documentation accompanying a tool, for example,
the character sets which are supported or the maximum size of a
term database. The acceptable values for the language and size
measures are already determined in the detailed requirements.
In other cases one should not rely on the manufacturer's own
description. So for example to check how many people can access the
tool at once and what they are allowed to do requires
experimentation with the tool itself. A good score for the number
of people who can efficiently work on the database at one time
would be 8 (since this is total number of translators employed). A
score of less than 3 would be unacceptable.
Other characteristics such as speed must be split up into
smaller measurable sub-attributes, and involve a number of
different factors which should be taken into consideration The time
it takes to retrieve a term may be affected by the size of the
database, and/or the number of other users working on the system at
the same time, and we want to measure these effects as well. Thus
we get different measures such as:
a. average time to retrieve a term from 100,000 term database
(single user)
b. average time to retrieve a term from 100,000 term database (3
users)
c. average time to retrieve a term from 100,000 term database (5
users)
a. average time to save a term in a 100,000 term database
(single user)
b. average time to save a term in a 100,000 term database (3
users)
c. average time to save a term in a 100,000 term database (5
users)
etc....
Sandra Manzi
Last modified: Tue Feb 22 15:11:13 MET 2000