Any feature checklist in the context of evaluation needs to be standardized in the sense that it should be applicable for any such tool and the results should be independent of situational variables. Moreover, the relation between the availability of a certain feature and the quality aspect it denotes has to be defined carefully.
Therefore a featurization of translation memories should explicitly state the purpose(s) of each feature (component, function, or attribute) of the translation memory and give an explanation of how each feature serves that purpose. It should start from a description of the uses of a translation memory in translation.
To fix terms, we propose the following definition of a translation memory:
a translation memory is a multilingual text archive containing (segmented, aligned, parsed and classified) multilingual texts, allowing storage and retrieval of aligned multilingual text segments against various search conditions.
Different translation memories differ as to the information stored along with the raw texts and the retrieval methods. This definition does not restrict translation memory to what is currently available in systems on the market.
The description below follows the division of functions in Steenbakkers' and des Tombe's featurization into off-line (analysis, import, and export) functions and online (in-translation) functions. These can be compared to a division of database functions into database management and database use (query/updating) functions in general. The point of this document is to bring out the theoretical questions, motivations, and options vis-à-vis each feature. Fuller featurizations are relegated to Appendix Feature Checklist Examples.
A translation memory is a collection of multilingual correspondences with optional control information stored with each correspondence. This characterization abstracts away from the actual manner of storing the correspondences (one-one, one-many, or many-many).
The control information can include information about the source text of the correspondence, its date, author, company, subject domain. This information may be used in ranking matches.
When a translation memory is used to support a given direction of translation, we can identify one segment of each correspondence as the (stored) source segment and another one as the (stored) target segment. A given query with a current source segment may return a number of correspondences with matching stored source segments.
Import transfers a text and its translation from a text file into the translation memory.
Raw format is any format in which an external source text and its translation may be available for importing into a translation memory (ascii, word processor format, unsegmented, unaligned). Import from raw format may require preprocessing of the texts by the user outside the system and/or interactive editing of the text inside the system. The system may also be primed to accept texts in given external mark-up formats.
Native format is a format used by the translation memory program to save translation memory in a file. Native format may retain segmentation, alignment and control information.
Analysis may mean processing of a multilingual text before importing it into a translation memory, or processing of a monolingual source text before submitting it to translation, defining the input output relationship in each case. Analysis involves parsing of the source and target texts to some depth.
The purpose of segmentation is to choose the most useful translation units.
Segmentation involves a type of parsing. For simplicity, parsing tasks are usually arranged in increasing order of complexity with little feedback from higher level of analysis to lower. Segmentation is done monolingually using superficial parsing (punctuation, mark-up), and alignment is in turn based on segmentation.
It has been found that if translators manually correct segmentations made by the program, then later versions of the same document will not find matches against the translation memory based on the corrected segmentation, for the program will repeat its own segmentation (errors).
If we want to benchmark segmentation, we need standard, segmented corpora or a standard segmenter. But is there such a thing as 'correct segmentation'? What makes a segmentation correct? That it produces natural translation units: units that are of the right size for translators to work on, and which allow useful translation correspondences between source and target languages.
The unit must be small enough that their degree of repetition is sufficient to ensure a significant hit rate (the smaller the unit, the greater the probability of repetition). But the units must not be so small that their alternative translations vary too much (there must be enough context to fix the translation), or that they are proper parts of more useful translation correspondences.
The optimum size of translation unit may vary with text type and translation task. At one extreme, translation of a new version of a previously translated text may allow a large percentage of perfect or near perfect matches (only constants such as names and numbers may have been changed). At another extreme, different texts written in the same style, genre or phraseology may have few perfect matches but a high degree of repetition in the terminology and phraseology used. As phraseology in particular has no fixed string identity, recognition of such repetition requires deeper linguistic analysis.
An ultimate solution to the segmentation and alignment problems comes close to the problem of statistical machine translation: the task is to find alignments which give the best predictions for further translation correspondences.
In practice, translators proceed 97 of the time sentence by sentence, although the translation of one sentence may depend on the translation of the surrounding sentences.
Alignment is the task of defining translation correspondences between source and target texts. Segmentation serves alignment which in turn serves the aim to increase the usefulness of the translation memory proposals for translation.
In principle, there should be feedback from alignment to segmentation: if translation uses different punctuation from the original, alignment may fail. A good alignment algorithm should be able to correct initial segmentation (put another way, alignment should consider other than 1-1 alignments between initial segmentations).
A further service that can be provided as an analysis stage is automatic (known or unknown) vocabulary or term extraction. Such extraction can have as input a previous dictionary or dictionaries. In addition, especially in the case of extracting unknown terms, it can use parsing and heuristics based on text statistics.
This raises a question of principle: what is the difference and division of labor between translation memory and term bank? Length of translation unit? Manual/automatic insertion of correspondences? Type, quality and amount of collateral information?
Users of TM/2 report that they use the term bank to store useful correspondences between segments not recognized by the TM/2 segmenter, not only well defined terminology and phraseology but other types of repetitive material as well.
Statistics is used to estimate the amount of work involved in a translation job. This is needed for planning and scheduling the work and for billing. Typical jobs for translation statistics are word counting and estimating amount of repetition in the text. Both tasks depend on the choice of unit of counting.
What constitutes repetition? Percentage of repetition depends on size of unit. Characters are trivially repeated; how large a percentage of individual words/sentences/paragraphs? What constitutes sameness (character per character identity, identity of base form of word)? What repetition is useful for translation?
Export involves transfer of text from the translation memory into an external text file. Import and export should be inverses.
What control information should be saved with a translation memory?
Should post-editing do search/replace in the aligned bilingual text, i.e. have access to the aligned text, perhaps even to the original translation memory response (mistakes being actually mis-translations rather than monolingual mistakes in the target text)?
Think of translation memories as databases, and merging as database table join. A row in the table consists of a segment, its translation and control information. A join is based on identity or fuzzy match in one or more of the fields.
Merging happens in some systems in conjunction with analysis, where one of the inputs to merging is an input text. This is called filtering here. TM/2 has an analysis option 'Create a file of untranslated segments' intended for input to an MT system. This is a type of filtering too.
Inversion means reversing the direction of translation (exchange of source and target languages of the TM).
Composition means deriving a TM for the language pair A-C from TM's for language pairs A-B and B-C.
During translation, the main purpose of the translation memory is to retrieve the most useful correspondences matching the current source segment in the memory for the translator to choose from.
Desiderata on the online functioning of a translation memory include that the translation memory must
For instance, in TM/2, preference is given to a match in the same document. This increases consistency within the text (e.g. several translators working with the same text will be given the same first choice).
Think of translation memories as databases, and retrieval as a database query. We have a partially described translation correspondence at hand and wish to retrieve from the translation memory one or more matching translation correspondences.
This symmetric way of looking at retrieval immediately suggests refinements of the usual query method. There could be many things the translators could restrict in the search condition besides the current source segment, in particular, properties of the target segment and control information.
An exact match is a perfect character by character match between current source segment and stored source segment.
Everything else is a fuzzy match. Some systems assign percentages to fuzzy matches. Such figures are not comparable across systems unless the method of scoring is specified. The score may depend on the depth of analysis done to the source segment. A 90 match could mean 90 of the stored source segment is identical with the current source segment counting in terms of character strings, word forms, content words, or yet another unit meaningful for translation.
Here, it is useful again to compare translation memories with databases. Translation memory is updated with a new translation correspondence when a translation has been accepted by the translator. As always in updating a database, there is the question what to do with the previous contents of the database.
More generally, we can ask whether a translation memory can be modified interactively during translation by adding, deleting or changing entries in the translation memory.
Does the system allow one or several translation memories to be open simultaneously? If it does, can entries be transferred between translation memories? (Why should we want to do that?)
A multilingual text archive lets the translation specify queries in the archive and returns responses for the translator to process as desired. A translation memory proper can do retrieval and even substitution automatically without help from the translator.
An integrated translation memory in a translator's workbench features automatic retrieval and evaluation of translation correspondences.
Exact matches come up in translating new versions of a document. If you translate automatically, you don't get to check the translation against the original. Any mistakes in the original will carry over.
Networking during translation makes possible efficient translation of a text in parallel by a team of translators. Translations and term entries entered by one translator are immediately made available to others.
On the other hand, if terminologies/translation memories are shared before the translations are final, any mistakes made by one translator are broadcast as easily as correct translations. (On the other hand, consistently incorrect translations may be easier to fix in post-editing.)