next previous contents
Next: User Profiles Up: Evaluation procedure Previous: Validation

Benchmarking translation memories

Whatever the system, the choice of attributes to benchmark should be based on two practical considerations:

The following sections apply these criteria to translation memories.


The bottom line of all translation aids is translation cost, which in turn depends on translation speed and translation quality. Both attributes are usually essential but inversely related; their relative weights may vary with the type of translation.

Translation speed in turn may depend not only on the speed of an individual translator, but the throughput of the whole production line, including pre- and post-editing, layout, terminology management, accounting, and so on. Using a system that allows a number of inter-communicating translators to work in parallel on the same job can achieve higher throughput than improvements on individual translation speed (cf. parallel processing vs. speedup of individual processors).

Thus we should describe different tasks and working setups for translators, and derive desiderata for a translation memory from them, and benchmarks from those desiderata. What is it that a translation memory is good for? The answers, or at least relative weights of different answers, may differ depending on text type and translation task.

For instance, there is a distinction between retranslation and translation of new material. When a document is retranslated (a new version of a manual, an instance of a fixed agreement text), large portions of the text may stay unchanged, or only constants (names, dates, codes, numbers) may change. Then the translation memory will contain previous version(s) of the document, and there is a high percentage of exact or near 100 matches. Parts of the document may be translated automatically and only changed segments presented to the translator.

At another extreme, a new text similar in style, terminology and phraseology to previous translations is to be translated. The percentage of exact matches to the previous translations may be low, but there may be significant overlap in style (terminology and phraseology) that should be exploited. The system should have an intelligent fuzzy matching algorithm or the search of useful matches should be left to the user.

A text may exhibit significant repetition internally. In this case, the translation memory should be updated as the translation proceeds so as to exploit the autocorrelation and ensure consistency throughout the document.

To sum up, a translator needs a translation memory to:

What are the most important properties of a translation memory given these objectives? Note that some properties are interdependent and can be inversely related (e.g. hit rate and speed of retrieval). The properties can be divided into off-line and online properties, alias population and use of a translation memory.

In general, much of the work to ensure useful matches during translation time can be done off-line, insomuch as it depends on materials available ahead of translation (model texts, source text).

For example, the size of the online translation memory can be kept proportional to the size of the text being translated, by filtering a larger translation memory (or several of these) against the text to be translated, so as to obtain a new translation memory that only contains the matches found (possibly with enough surrounding context to help gauge the relevance of the match).

Autocorrelation forms an exception to this, i.e. the case where the translation memory is updated as the translation proceeds and used immediately in translating the rest of the text.

Online properties

Size (capacity of translation memory).

This has an effect on hit rate and speed. The optimum size is likely to depend on text and translation type. The less repetition, the more text is needed to obtain an acceptable hit rate. (Again filtering may help.)

Speed (retrieval time).

This depends on size and on the match algorithm. In translation in general, the raw speed of translation aids is strongly dominated by the time spent by the user in screening and editing translation proposals. What may be of interest is the behavior of the system (speed, hit rate) as the size of the memory grows (to determine optimum translation memory size).

Hit rate (number of matches).

This depends on size, segmentation, match algorithm, and match evaluation algorithm. There is an optimum to shoot for here rather than a maximum. In general, translators prefer few high quality matches to many matches of doubtful use. The reason is simple; screening matches by eye is slow, tiring, and error prone, and editing around a bad match may be slower than writing a new translation from scratch.

Off-line properties

Text analysis methods.

We would like to stress the quality characteristic of customizability or adaptability. Systems currently on the market tend to have an overly narrow view on text analysis: built-in parsers for segmentation, built-in match algorithms. At least in one case, no off the shelf translation memory was found acceptable because segmentation could not manage sentences of over 100 words in length and did not provide any easy way to modify the definition of segments. The SGML style approach with a generic parser which can accept user defined grammars should be pushed as a standard here as well.

Another aspect is depth of linguistic analysis. This cuts two ways: systems which depend on linguistic analysis become language bound. For instance, TM/2 is not easily extended to a language like Finnish with rich morphology. On the other hand, future better (more human like) fuzzy matching algorithms are bound to identify units meaningful for translation (such as shared words, terminology, phraseology, or similar grammar).

Segmentation and alignment: success rate

These affect hit rate and usefulness of matches. What should count as "correct" segmentation and alignment depends on the usefulness of the resulting matches to the translator. That can depend on text and translation type. Given a particular definition of segment, we can benchmark whether a translation memory program segments and aligns a text and its translation according to the definition (accuracy). Or, we can try to measure the usefulness of a given method of segmentation or alignment for translation (suitability). See section Evaluation procedure for further discussion of these types of benchmark.

Initial (raw) import into translation memory

This measures portability. This is of interest to translators that have a large quantity of translated material at hand before acquiring a system, or to translation companies that work with a variety of systems. Here, EAGLES could propose or endorse a standard format for aligned texts (e.g. one defined by the TEI initiative).

Translation memory combinatorics

These interact with size and speed too, because good off-line methods for constructing text specific translation memories can diminish size and improve hit rate during translation. The possibilities of inverting and composing translation memories may not be as practical as they may seem (in general, translation relations are not symmetric nor transitive).

Export of translation memory

This may be of particular interest for loose knit work groups using various types of tool/environment and communicating irregularly and/or via slow connections (modem, fax, diskette, paper).

Specifying benchmarks

Starting from the feasibility angle, to define a benchmark we have to be able to provide:

Attributes have been discussed in the previous section. Some measures one can think of are:

The inputs of the benchmark can be any of the following:

The differences are these: using raw input, the tester has to check all instances of the input-output relation, to ensure that when the input is of a certain type, then the output is of an expected type. Using a test corpus, the tester can assume that the inputs are of a certain form, and it suffices to check that the outputs satisfy expected criteria. Using a test suite, it suffices to compare the outputs of the benchmark to the standard outputs of the test suite to find any differences.

Test suites make sense when the range of acceptable outputs is not too varied (for instance, there is only one correct output for each input). This is not generally the case for translations: there is no one correct translation of a given segment. Individual benchmarks may still satisfy this restriction: e.g. a benchmark checking retrieval of exact matches could input a corpus of n distinct segments and check that these segments are retrieved against the same corpus as the source text.

The next question is how the data and/or tools used in the benchmark are provided. There are various alternatives:

Alternative 1 imposes all the work on tester, while 4 is easiest for the tester. On the other hand, 4 is least flexible and leaves the door open for 'tweaking' the system with benchmark specific optimizations. Alternatives 2-3 seem best in principle: they provide an unlimited supply of data as well as objective control on the quality of the data.

There is a major watershed between benchmarks that are run between the tester and the program and those that involve users as test subjects. In fact benchmarks that involve users threaten to come close to scenario testing. If there is a difference, a benchmark involving users would just quantify limited aspects of user behavior (keystrokes, time, errors) in a limited test situation, trying to control variables extraneous to the particular benchmark. A scenario test would attempt to simulate a normal working environment.

The interpretation part of the benchmark specification should tell a tester what the benchmark can tell about a product. For instance, if it is critical for a site to have access to a large mass of texts simultaneously, run a capacity benchmark. If there is a high turnover of personnel or a deadline to meet, run a learnability benchmark.

Suggestions for translation memory benchmarks

Translation memory: segmentation and retrieval.

Component: translation memory, application.

Related to attribute: accuracy (complete, correct)


The results will depend a lot on the variability within T but that will not prevent them from serving as a basis for comparison amongst translation memories. Part of the experiment is automatable, for other parts only limited human intervention is necessary (providing input for memory filling, hand-checking translations generated from the memory).

The method described above is applicable for evaluating the system's ability to find 100 matches. (Only there the notion of percentage of match is well defined.)

A further refinement is to study the usefulness of (less than 100) matches. This could be done empirically by observing the use translators make of (non-100) matches.

Another more principled approach is to use calibrated texts, using system independent methods to determine useful matches in them.

Instead of providing ready made texts, it would be more useful to provide methods to produce such, i.e. for evaluating the degree of repetition in a given text. This angle has some theoretical interest too for translation theory. EAGLES should review literature on this.

Translation memory, import from external format.

A similar benchmark is possible for translation memory in update mode. Given some training base T (a set of predefined translations), measure the effort involved in executing the updating for T.

Translation memory, learnability.

Benchmarking ease of learning and using is a lot more complicated. A problem is that most of these require simulated actual use, and so involve a human user.

This makes the measurements less reliable and moreover less efficient to apply.

For example, think of the idea of benchmarking learnability of translation memory use. Typically, such a benchmark will be defined as follows: define some initial state q0 of a human being, e.g. a translator who has no experience with using translation memories at all yet. Define a criterion state qc which is the state of this same person but now a competent user of the system. The experimental task is to get from q0 to qc. Measurement is e.g. the time needed and the number of keystrokes applied. The problem is that after having done this for system A, our experimental person is no longer in state q0 and so we need somebody else for the measurement on system B. But people differ in all sorts of relevant aspects like intelligence, so the measurement is not automatically reliable. Normally one would try to compensate by applying various conditions to randomly composed groups of people from a rather homogeneous population; we can propose that but the idea of efficient benchmarking suffers a lot.

Translation memory, portability

For some other attributes of interest, like portability, benchmarking may come down to comparing systems with a pre-given checklist.

Benchmarks suggested by Steenbakker's featurization

In the following, `primitive actions' are actions like pressing a key on the computer's keyboard or clicking or dragging with a mouse.

Determining the minimum number of such actions needed for some given task may be a measure of `user-friendliness'. Of course the possibility to define macros has to be taken into account.

We propose to choose as an experimental benchmark the ability of a translation memory to recover different sorts of near (fuzzy) matches. The test will consist of entering a given text to the translation memory, performing certain systematic changes on the text, and measuring the recall of the system after those changes. The changes will range systematically from identical segments to changes of punctuation, changes in constants (numbers, names), changes in segment length (shorter and longer segments than stored), changes in wording (words substituted, added or left out), changes in sentence structure (word order, grammatical construction), etc. The tests will be so designed that they would give an indication as to what kind of fuzzy matching algorithm the system is using.

The test material can be a test suite (a given text with a set of variants), or, even better, a program that will make various types of change to any given text. This would appear to be one of the most crucial types of benchmark one can run on a translation memory, one that gets to the heart of this type of program.

next up previous contents
Next: User Profiles Up: Evaluation procedure Previous: Validation