Comments: Evaluation requirements subsume the specification of user needs, along with other considerations preliminary to evaluation. "User" here should not be taken to necessarily mean the end user of a system or of the translation produced by a system. They are simply the user of the object being evaluated or even of the results of the evaluation, and may thus, depending on the kind of evaluation being done, be a research worker, a developer, a vendor, a manager and so on.
Definition:
The purpose of evaluation is to provide a basis for making decisions.
References for this characteristic:
Definition:
Internal evaluation occurs on continual or periodic bases in the course of research and development. Internal evaluations test whether, for example, the components of an experimental prototype or pre- release system work as they are intended.
This type of evaluation mainly concerns functionality and needs to show coverage of the fundamental contrastive phenomena of the language pair, just like feasibility evaluation. However, at this point in a system's life cycle, it must also be shown that the system is actually improving as a result of development (changeability), and that improvement in one area does not make something else worse (stability). (In terms of EAGLES 1996, this is a progress evaluation).
References for this characteristic:
Reeder dissertation (forthcoming)
Comments: Diagnostic evaluation is arguably a subset of the internal evaluations; however, segregating the diagnostic needs at the same level as internal does improve the distinction between iterative progress tests and tests to find a specific problem.
Definition: The purpose of diagnostic evaluation is to discover why a system did not give the results it was expected to give. Typically performed by a researcher developing a prototype system, such an evaluation is almost exclusively concerned with functionality characteristics and will also often make use of internal metrics based on the intermediate results the system produces. Diagnostic evaluation typically uses glass-box evaluation principles.
References for this characteristic: EAGLES 1996
Comments: Internal evaluation may also be glass box, and it is also possible to use black-box evaluations to do diagnostics.
The property "glass/black box" does not spawn children, but is rather a property which distinguishes certain methods under more than one taxon.
Definition:
The purpose of declarative evaluation is to measure the ability of an MT system to handle texts representative of an actual end-user.
It is concerned with coverage of linguistic phenomena and handling of samples of real text. Declarative evaluations generally test for the functionality attributes of intelligibility, (how fluent or understandable it appears to be) and fidelity (the accurateness and completeness of the information conveyed).
References for this characteristic: White 2000
Definition:
Operational evaluations generally address the question of whether an MT system will actually serve its purpose in the context of its operational use. The primary factors include the cost-benefit of bringing the system into the overall process (costs).
References for this characteristic: White 2000
Comments:
A variety of issues are considered here, including such things as software and hardware compatibility with the incumbent office automation system (interoperabililty). However, the more fundamental question to ask for operational use is whether the MT system enhances the effectiveness of the down stream task, or whether the end-to-end process is better off without it.
As an example, consider cross lingual information retrieval. Evaluation of MT embedded into a cross lingual information processing environment takes into account the measures that are germane to the downstream task. So if we want to know whether an MT system helps information extraction we compare the recall and precision (metrics germane to extraction) of the MT plus extraction configuration to an expert translation plus extraction process, or to an extraction without any translation at all. Note that we do not measure functionality characteristics of the MT system itself, such as fidelity and intelligibility, but rather the effect of the MT (good or bad) on the downstream task in term of that task's metrics. To a large extent then, operational evaluation lies outside the bounds of this classification, which is concerned only with the classification and evaluation of MT systems.
Definition: The purpose of usability evaluation is to measure the ability of a system to be useful to people who are actually going to use it. ISO 9126 talks of "quality in use" characteristics, which are the combination of other characteristics which will enable a user to achieve specified goals with effectiveness, productivity, safety and with satisfaction in a specified context use.
References for this characteristic: White (2000)
Comments: Usability evaluation is a domain in its own right, which involves kinds of testing such as scenario and laboratory testing which are common to many kinds of software product. It is often undertaken by the manufacturers of products before the product is launched on the market. It falls outside the scope of the current classification. However, much information about usability evaluation can be found by consulting the European Usability Support Centres home page.
Definition: According to White (2000) a feasibility study is an evaluation of the possibility that a particular approach has any potential for success after further research and implementation. Feasibility evaluations provide results of interest to researchers and to sponsors of research. The characteristics that a feasibility evaluation typically tests for are functionality attributes such as the coverage of sub-problems particular to a specific language pair and the possibility of extending to more general phenomena (changeability).
Definition:
Requirements discovery is often an iterative process in which developers create prototypes in order to elicit reactions from potential stakeholders. In so-called "rapid prototyping" approaches to requirements discovery, developers create prototypes designed to demonstrate specific aspects of functional capabilities that might ultimately be implemented. Scenario-based observational studies are often used to assess the utility of the functions demonstrated by the prototype.
References for this characteristic:
Definition: Characteristics of the translation task refers to the information flow intended for the output, from the point of view of the agent (human or otherwise) who receives the translation.
References for this characteristic:
Comments:
As was noted by J.C.Sager for Machine Translation systems, "two types of use [are] to be considered: (a) the un-edited output; (b) the edited output. The output may be acceptable for either use or both and the evaluation should determine this. In the case of edited output the cost of revision, editing etc. has to be established and compared with the cost of manual translation. Since the type of use is related to the type of text, these types have to be established and taken into account."
In Toward Finely Differentiated Evaluation Metrics for Machine Translation, Hovy suggests dividing all the possible translation tasks into three main groups. He noted that "in order to make the taxonomization of features useful to people who do not already know much about MT and do not wish to become experts in evaluation, it is important to articulate its layers and choices in terms they can intuitively understand." This part of the present evaluation taxonomy describes three principal types of use in such a way that users can identify the particular type of work they want to have done, while developers can define in strict terms what their MT system can do.
Definition: The ultimate purpose of the assimilation task (of which translation forms a part) is to monitor a (relatively) large volume of texts produced by people outside the organization, in (usually) several languages.
References for this characteristic: Hovy 1999
Comments: This requirement is also related to the "Domain" requirement: it must be clear to which domains the system can be applied.
Definition: The purpose of document routing / sorting is to scan incoming translated documents quickly in order to send them to the appropriate points for further processing or storage.
References for this characteristic:
Definition: The purpose of information extraction or summarization is to extract some portion(s) of the translated text, either manually or automatically, for subsequent processing or storage. Information extraction is typically concerned with filling templates by identifying atomic elements of events. In contrast, summarization aims to provide a self-contained and internally cohesive text which serves as a selective account of the original.
References for this characteristic:
Definition: The goal of a search process is to identify a set of documents that together can satisfy an information need. Subtasks include refinement of the searcher's understanding of their need, refinement of the expression of that need as a query, and recognition of relevant documents. Automated components of search systems typically accomplish only portions of the required task, leaving the searcher to assess factors (e.g., veracity and completeness) that would be difficult to detect by automated means. Searchers with limited proficiency in languages in which the document are written will require translation support to accomplish information need refinement, query reformulation, and relevant document recognition.
References for this characteristic:
Definition: The ultimate purpose of dissemination is to deliver to others a translation of documents produced inside the organization.
References for this characteristic:Hovy 1999
Comments: This requirement is also related to the Domain requirement (1.5.1.2/137): it must be clear to which domains the system can be applied.
Definition: In the case of internal / in-house dissemination the translations are sent to other people in the same organization, who share aspects of the culture, terminology, and domain knowledge to some extent.
The most important feature for this type of task is: speed - how fast is the system, can it keep up with the demand for input.
Definition: The recipients of translation perform a relatively routine task that does not require much variability in the translation service.
Definition: The recipients of translation perform a rather variable task, and hence may request translations in new domains, genres, or extensions.
Comments: The following feature is especially important for this task:
Adaptability - how easily can the system's output be changed in response to requests from the recipients (addition of new words, use of different phrases and expressions, etc.)
Definition: In the case of external dissemination / export / publication the translations are sent to other people in other organizations, who may not share aspects of the culture, terminology, and domain knowledge.
Definition: The recipients of the translation all have essentially the same needs; their translations do not require specific tailoring.
Definition: Since the recipients of the translation have different needs and capabilities, translation has to be tailored to them.
Definition: The ultimate purpose of the communication task is to support multi-turn dialogues between people who speak different languages. The translation quality must be high enough for painless conversation, despite possible syntactically ill-formed input and idiosyncratic word and format usage. The ultimate purpose of dissemination is to deliver to others a translation of documents produced inside the organization.
References for this characteristic:Hovy 1999
Definition: In the case of synchronous or interactive communication, the interaction between the participants occurs in real time.
Definition: In the case of asynchronous or delayed communication the interaction between participants occurs with interruption, for example by email.
Definition: Input characteristics refer to the stylistic form or format of the source document, the topic domain, and both the competency and performance qualities of the author.
Definition: The type of the input document can greatly affect the output of an MT system. For example, inputs to the METEO system are specific and very restricted, mainly weather forecast texts, using a limited lexicon and particular syntactic constructions. As a result the system produces accurate output, comparable to human translation. In contrast, MT of arbitrary text invariably produces output of much lesser quality. Both the genre and the application domain determine the quality.
Definition:
Genre refers to the characteristic or definitive form and style peculiar to a type of document.
Examples of genre are: newspaper articles; scientific and technical articles; recipes and instructions; correspondence; business/commercial reports; marketing texts and advertisements; legal texts; literature: novels, poetry, etc.; and many others.
Definition: Domain refers to topic, the field of interest for which the document is relevant, and the potential sublanguage effects germane to MT, for example technical/scientific (specific field being biology, chemistry, automotive mechanics, etc.), social, etc.
Definition: This set of characteristics covers writer attributes that are relevant to the writing task, which influence the unproofed text that is produced.
Comments: We cannot assume that the writer has exactly the goals of the end-user (or reviser) in terms of the proofed text. For example, a text originating with an American writer intended for a USA audience has to be revised when British English is the target. Thus, the definition of relevant idiosyncratic intentional error sources is dependent on what the actual final readership is; there may be many other idiosyncratic error sources that are not relevant to the actual task target (EAGLES-96).
Definition:
This refers to proficiency in the source language as attested by some recognised measurement, international or regional.
Two of the best known language proficiency scales are the ACTFL guidelines (first proposed in 1983 by the American Council for the Teaching of Foreign Languages) and the the ILR (FSI) proficiency scale, a five-level scale originally developed by the Foreign Service Institute (FSI) of the United States government, and later adopted by other services under the name of Interagency Language Roundtable (ILR) scale. The scale proposed for use in FEMTI is based on the ACTFL guidelines.
References for this characteristic:ACTFL 1983, ACTFL 1999, ACTFL 2001, Orwig 1998, Clark and Clifford 1988.
Comments:
Depending on the type of input to the MT system (spoken or written), the guidelines for speaking vs. writing should be used.
This characteristic might also influence the "source of errors" (below).
Definition:
" Novice-level writers are characterized by the ability to: produce lists and notes and limited formulaic information on simple forms and documents; recombine practiced material supplying isolated words or phrases to convey simple messages, transcribe familiar words or phrases, copy letters of the alphabet or syllables of a syllabary, or reproduce basic characters with some accuracy; communicate basic information. " (ACTFL 2001).
Comments:
Could be further refined into Novice-Low, Novice-Mid and Novice-High.
The Novice-Low and Novice-Mid levels correspond to the 0 level of the ILR scale. The Novice-High level corresponds to the 0+ level of the ILR scale. (Orwig 1998).
Definition:
" Intermediate-level writers are characterized by the ability to: meet practical writing needs -- e.g., simple messages and letters, requests for information, notes -- and ask and respond to questions; create with the language and communicate simple facts and ideas in a loosely connected series of sentences on topics of personal interest and social needs, primarily in the present; express meaning through vocabulary and basic structures that is comprehensible to those accustomed to the writing of nonnatives. " ACTFL 2001.
Comments:
Could be further refined into Intermediate-Low, Intermediate-Mid and Intermediate-High.
The Intermediate-Low and Intermediate-Mid levels correspond to level 1 of the ILR scale. The Intermediate-High level corresponds to the level 1+ of the ILR scale. (Orwig 1998).
Definition:
" Advanced-level writers are characterized by the ability to: write routine informal and some formal correspondence, narratives, descriptions, and summaries of a factual nature; narrate and describe in major time frames, using paraphrase and elaboration to provide clarity, in connected discourse of paragraph length; express meaning that is comprehensible to those unaccustomed to the writing of non-natives, primarily through generic vocabulary, with good control of the most frequently used structures. " ACTFL 2001.
Comments:
Could be further refined into Advanced-Low, Advanced-Mid and Advanced-High.
TheAdvanced-Low and Advanced-Mid levels correspond to level 2 of the ILR scale. The Advanced-High level corresponds to level 2+ of the ILR scale. (Orwig 1998).
Definition:
" Superior-level writers are characterized by the ability to: express themselves effectively in most informal and formal writing on practical, social, and professional topics treated both abstractly as well as concretely; present well developed ideas, opinions, arguments, and hypotheses through extended discourse; control structures, both general and specialized/professional vocabulary, spelling or symbol production, punctuation, diacritical marks, cohesive devices, and other aspects of written form and organization with no pattern of error to distract the reader " ACTFL 2001.
Comments:
The Superior level is not refined in the ACTFL guidelines.
The Superior level corresponds to levels 3 and 3+ of the ILR scale. The ILR scale also includes levels 4 and 5, characterized in earlier versions of the ACTFL guidelines respectively as Distinguished and Native (Orwig 1998).
A person at ILR level 4 (or S-4) is " able to use the language fluently and accurately on all levels normally pertinent to professional needs" (Orwig 1998).
Definition: Related to the experience of the author in producing a particular type of texts, i.e is he familiar with the terminology? how long has he/she been working in similar posts?
Definition: This set of characteristics covers the errors that are likely to be in the unproofed text. Errors are defined as the difference between the unproofed text and the subsequent proofed text.
Definition:
Errors in this category include dialect differences between the writer's language and some standard language, second language errors such as wrong prepositions in prepositional phrases and genuine misconceptions.
References for this characteristic: EAGLES Report - "Evaluation of Writer's aids"
Definition: Considering sources of writer errors during the writing process, this characteristic includes: errors from speech recognition, from OCR, cut/ copy and paste slips, etc.
Definition:
Depending on the writer model, this type of errors includes concentration lapses resulting in "derailed" sentences (for example slips through tiredness), planning fault errors (for example failed agreement between noun phase determiner and header) and other performance errors.
References for this characteristic: EAGLES Report - "Evaluation of Writer's aids"
Definition:
This covers the characteristics of users in three senses: the end user who will interact with the machine translation system; the end user of the final product of the translation process which may include for example, post-editing; the organisation deploying the machine translation system.
Note however that in the case when machine translation is combined with substantial post-editing, the resulting "system" might no longer fall under the scope of FEMTI, hence the end users are no longer users of a machine translation system.
Comments:
Users here are human users. When an MT system is a component of a larger system, for example an information retrieval system, other pieces of software may be considered to be users of MT output. Which parts of the quality model will apply in the case of a software user is inherently linked to the overall application in question. Software as user is not explicitly considered in the current version of this taxonomy.
When the object of evaluation is an MT system considered as a component of a larger system, evaluation critically depends on the larger application, and the metrics will be those of the "upstream" and "downstream" processes. This aspect of MT evaluation is not dealt with in any any detail in the current version of FEMTI.
Definition: This refers to the person who interacts with the machine translation system and with the output produced by it.
Comments:
It is also possible, though not always very clear, to distinguish the translation "producer" (who interacts directly with the MT system or with its raw output), often a translator or a post-editor, from the person or organisation to whom the translation product is delivered. However, if the translation product is substantially edited, then the latter type of user can no longer be considered as a user of the MT system.
Subsequent use of the translation is intimately related to characteristics of the translation task (see 1.2 Characteristics of the translation task).
Definition: This refers to the formal linguistic education of the user as indicated by his local education system
Definition:
This refers to proficiency in the source language as attested by some recognised measurement.
The level of proficiency may be measured, for example by local education tests, internationally recognised examination schemes or organisation internal testing.
Two of the best known language proficiency scales are the ACTFL guidelines (first proposed in 1983 by the American Council for the Teaching of Foreign Languages) and the the ILR (FSI) proficiency scale, a five-level scale originally developed by the Foreign Service Institute (FSI) of the United States government, and later adopted by other services under the name of Interagency Language Roundtable (ILR) scale.
The scale proposed for use in FEMTI is based on the ACTFL guidelines for reading (1985) -- note that only the guidelines for writing/speaking have been recently updated.
References for this characteristic: ACTFL 1983, ACTFL 1999, ACTFL 2001, Orwig 1998, Clark and Clifford 1988.
Comments:
In the case of generic translation consumers, it is often not feasible to have more than an informal notion of the degree of their proficiency in the source language.
The degree to which this characteristic is pertinent in a specific evaluation will depend on what the translation consumers will do with the translation delivered to them, for example, whether they will in some way repair or polish it.
Depending on the type of output of the MT system (spoken or written), the guidelines for reading vs. listening should be used.
Definition:
" The reader can identify an increasing number of highly contextualized words and/or phrases including cognates and borrowed words, where appropriate. [...] [May have] sufficient control of the writing system to interpret written language in areas of practical need. Where vocabulary has been learned, can read for instructional and directional purposes, standardized messages, phrases, or expressions, such as some items on menus, schedules, timetables, maps, and signs. " (ACTFL 1983 guidelines for reading proficiency).
Comments:
Can be further refined into Novice-Low, Novice-Mid and Novice-High.
The Novice-Low and Novice-Mid levels correspond to the 0 level of the ILR scale. The Novice-High level corresponds to the 0+ level of the ILR scale. (Orwig 1998).
Definition:
" Able to read consistently with increased understanding simple, connected texts dealing with a variety of basic and social needs. Such texts are still linguistically noncomplex and have a clear underlying internal structure. [...] Examples may include short, straightforward descriptions of persons, places, and things written for a wide audience. [...] Can get some main ideas and information from texts at the next higher level featuring description and narration. Structural complexity may interfere with comprehension; for example, basic grammatical relations may be misinterpreted and temporal references may rely primarily on lexical items. " (ACTFL 1983 guidelines for reading proficiency).
Comments:
Can be further refined into Intermediate-Low, Intermediate-Mid and Intermediate-High.
The Intermediate-Low and Intermediate-Mid levels correspond to level 1 of the ILR scale. The Intermediate-High level corresponds to the level 1+ of the ILR scale. (Orwig 1998).
Definition:
" Advanced: Able to read somewhat longer prose of several paragraphs in length, particularly if presented with a clear underlying structure. The prose is predominantly in familiar sentence patterns. Reader gets the main ideas and facts and misses some details. [...] Texts at this level include descriptions and narrations such as simple short stories, news items, bibliographical information, social notices, personal correspondence, routinized business letters, and simple technical material written for the general reader. " (ACTFL 1983 guidelines for reading proficiency).
" Advanced Plus: Able to follow essential points of written discourse at the Superior level in areas of special interest or knowledge. Able to understand parts of texts which are conceptually abstract and linguistically complex, and/or texts which treat unfamiliar topics and situations, as well as some texts which involve aspects of target-language culture. Able to comprehend the facts to make appropriate inferences. [...] Misunderstandings may occur. " (ACTFL 1983 guidelines for reading proficiency).
Comments:
Can be further refined into Advanced and Advanced Plus (in the 1983 version).
The Advanced level correspond to level 2 of the ILR scale. The Advanced Plus level corresponds to level 2+ of the ILR scale. (Orwig 1998).
Definition:
" Able to read with almost complete comprehension and at normal speed expository prose on unfamiliar subjects and a variety of literary texts. Reading ability is not dependent on subject matter knowledge, although the reader is not expected to comprehend thoroughly texts which are highly dependent on knowledge of the target culture. [...] Occasional misunderstandings may still occur; for example, the reader may experience some difficulty with unusually complex structures and low-frequency idioms. [...] Material at this level will include a variety of literary texts, editorials, correspondence, general reports, and technical material in professional fields. Rereading is rarely necessary, and misreading is rare. " (ACTFL 1983 guidelines for reading proficiency).
Comments:
The Superior level corresponds to levels 3 and 3+ of the ILR scale (Orwig 1998).
Definition:
" Able to read fluently and accurately most styles and forms of the language pertinent to academic and professional needs. Able to relate inferences in the text to real-world knowledge and understand almost all sociolinguistic and cultural references by processing language from within the cultural framework. Able to understand a writer's use of nuance and subtlety. Can readily follow unpredictable turns of thought and author intent in such materials as sophisticated editorials, specialized journal articles, and literary texts such as novels, plays, poems, as well as in any subject matter area directed to the general reader. " (ACTFL 1983 guidelines for reading proficiency).
Comments:
The Distinguished level is not refined in the ACTFL guidelines.
The Distinguished level corresponds to levels 4 and 4+ of the ILR scale. The ILR scale also includes level 5 corresponding to the Native level, which is not specified by the ACTFL guidelines (Orwig 1998).
A person at ILR level 4 (or S-4) is " able to use the language fluently and accurately on all levels normally pertinent to professional needs" (Orwig 1998).
Definition:
This refers to proficiency in the source language as attested by some recognised measurement.
The level of proficiency may be measured, for example by local education tests, internationally recognised examination schemes or organisation internal testing.
Two of the best known language proficiency scales are the ACTFL guidelines (first proposed in 1983 by the American Council for the Teaching of Foreign Languages) and the the ILR (FSI) proficiency scale, a five-level scale originally developed by the Foreign Service Institute (FSI) of the United States government, and later adopted by other services under the name of Interagency Language Roundtable (ILR) scale.
The scale proposed for use in FEMTI is based on the ACTFL guidelines.
Depending on the operations performed on the translation, it is either the reading or the writing proficiency which are more specifically relevant. We propose to use the ACTFL eading proficiency scale (1985) -- note that only the guidelines for writing/speaking have been recently updated.
References for this characteristic: ACTFL 1983, ACTFL 1999, ACTFL 2001, Orwig 1998, Clark and Clifford 1988.
Comments:
In the case of generic translation consumers, it is often not feasible to have more than an informal notion of the degree of their proficiency in the source language.
In a specific evaluation, the level of proficiency in the target language of a potential or intended translation consumer will strongly influence what are considered as acceptable measures of functionality.
Depending on the type of output of the MT system (spoken or written), the guidelines for reading vs. listening should be used.
Definition:
" The reader can identify an increasing number of highly contextualized words and/or phrases including cognates and borrowed words, where appropriate. [...] [May have] sufficient control of the writing system to interpret written language in areas of practical need. Where vocabulary has been learned, can read for instructional and directional purposes, standardized messages, phrases, or expressions, such as some items on menus, schedules, timetables, maps, and signs. " (ACTFL 1983 guidelines for reading proficiency).
Comments:
Can be further refined into Novice-Low, Novice-Mid and Novice-High.
The Novice-Low and Novice-Mid levels correspond to the 0 level of the ILR scale. The Novice-High level corresponds to the 0+ level of the ILR scale. (Orwig 1998).
Definition:
" Able to read consistently with increased understanding simple, connected texts dealing with a variety of basic and social needs. Such texts are still linguistically noncomplex and have a clear underlying internal structure. [...] Examples may include short, straightforward descriptions of persons, places, and things written for a wide audience. [...] Can get some main ideas and information from texts at the next higher level featuring description and narration. Structural complexity may interfere with comprehension; for example, basic grammatical relations may be misinterpreted and temporal references may rely primarily on lexical items. " (ACTFL 1983 guidelines for reading proficiency).
Comments:
Can be further refined into Intermediate-Low, Intermediate-Mid and Intermediate-High.
The Intermediate-Low and Intermediate-Mid levels correspond to level 1 of the ILR scale. The Intermediate-High level corresponds to the level 1+ of the ILR scale. (Orwig 1998).
Definition:
" Advanced: Able to read somewhat longer prose of several paragraphs in length, particularly if presented with a clear underlying structure. The prose is predominantly in familiar sentence patterns. Reader gets the main ideas and facts and misses some details. [...] Texts at this level include descriptions and narrations such as simple short stories, news items, bibliographical information, social notices, personal correspondence, routinized business letters, and simple technical material written for the general reader. " (ACTFL 1983 guidelines for reading proficiency).
" Advanced Plus: Able to follow essential points of written discourse at the Superior level in areas of special interest or knowledge. Able to understand parts of texts which are conceptually abstract and linguistically complex, and/or texts which treat unfamiliar topics and situations, as well as some texts which involve aspects of target-language culture. Able to comprehend the facts to make appropriate inferences. [...] Misunderstandings may occur. " (ACTFL 1983 guidelines for reading proficiency).
Comments:
Can be further refined into Advanced and Advanced Plus (in the 1983 version).
The Advanced level correspond to level 2 of the ILR scale. The Advanced Plus level corresponds to level 2+ of the ILR scale. (Orwig 1998).
Definition:
" Able to read with almost complete comprehension and at normal speed expository prose on unfamiliar subjects and a variety of literary texts. Reading ability is not dependent on subject matter knowledge, although the reader is not expected to comprehend thoroughly texts which are highly dependent on knowledge of the target culture. [...] Occasional misunderstandings may still occur; for example, the reader may experience some difficulty with unusually complex structures and low-frequency idioms. [...] Material at this level will include a variety of literary texts, editorials, correspondence, general reports, and technical material in professional fields. Rereading is rarely necessary, and misreading is rare. " (ACTFL 1983 guidelines for reading proficiency).
Comments:
The Superior level corresponds to levels 3 and 3+ of the ILR scale (Orwig 1998).
Definition:
" Able to read fluently and accurately most styles and forms of the language pertinent to academic and professional needs. Able to relate inferences in the text to real-world knowledge and understand almost all sociolinguistic and cultural references by processing language from within the cultural framework. Able to understand a writer's use of nuance and subtlety. Can readily follow unpredictable turns of thought and author intent in such materials as sophisticated editorials, specialized journal articles, and literary texts such as novels, plays, poems, as well as in any subject matter area directed to the general reader. " (ACTFL 1983 guidelines for reading proficiency).
Comments:
The Superior level is not refined in the ACTFL guidelines.
The Distinguished level corresponds to levels 4 and 4+ of the ILR scale. The ILR scale also includes level 5 corresponding to the Native level, which is not specified by the ACTFL guidelines (Orwig 1998).
A person at ILR level 4 (or S-4) is " able to use the language fluently and accurately on all levels normally pertinent to professional needs" (Orwig 1998).
Definition: This refers to the degree to which the user is at ease in computer use and manipulation.
Comments:
This characteristic does not include the user's level of familiarity with the particular system being evaluated.
The user's level of proficiency will in part relate to evaluation of usability.
Standardised tests of computer literacy are being developed , for example the European "computer driving licence"
Definition: An organisational user of MT may be a corporate user, a translation service, a translation agency or other provider of translation.
Comments: This characteristic is intended to capture some of the needs common to organisational users of MT. It does not cover needs influenced by the character of the translation task (1.3), or needs influenced by the characteristics of the MT system users (1.4.1) or by the characteristics of translation consumers within the organisation (1.4.2).
Definition: This concerns the volume of translation typically dealt with by the organisation.
Comments:
The volume of translation work may be measured in many ways, including pages per day, week, month or year. It can also be described by how much time is used in the translation work. Usually this is measured in person-hours. Naturally the amount of work correlates with the quality of the translation, which may vary from text to text.
In addition, the amount of work typically varies with the target and source languages (EAGLES-96).
A factor which has become relevant recently in estimating the quantity of translation in some contexts is the existence or non-existence of pertinent translation memories
Definition: This concerns the number of personnel within the organisation who will be directly concerned with the use of the MT system.
Comments: The number of personnel needed for pre- and post-editing, dictionary and grammar enhancement etc., is a factor that must be taken into account when evaluating an MT system. Similarly, the availability of suitably qualified personnel within the organisation can be of importance. If new staff need to be hired, training costs should be taken into account.
Definition: This concerns the deadlines for translation production typical within the organisation.
Comments: Different kinds of translation work may typically have different deadlines. Of particular interest when MT is being considered is the case of large volume translatin which must be produced very quickly.
Quality is a complex notion that depends on the point of view of the different actors related to an MT system. It is most often related to the judgment of final users (Dostert); or defined as the composite measurement of fidelity, intelligibility and elegance (Johnson); or it is a results of the analysis of situational dimensions (House) - all in the Van Slype report.
Comments:
ISO 9126 distinguishes between internal characteristics which pertain to the internal static properties of the software and external characteristics which are the characteristics which can be observed when the system is in operation. There is some connection here with the notions of glass box and black box evaluation.
Characteristics of measurements in general
A measurement is the use of a metric to assign a measure (a value, which may be a number or category) from a scale to a quality/attribute of an entity (ISO/IEC 9126-1:2001(E)). Whatever the measure applied, all measures share certain characteristics. An adequate description of a measure should include the following properties:
1. Definitional characteristics
1.a Textual definition and description of the metric
1.b Input to measurement process (e.g., sentence, text, source+target text, system, etc.)
1.c Measure, i.e., output of measurement process (e.g., number on a scale, symbolic value, Y/N decision, etc.)
2. Dependencies
2.a Domain/genre dependence (lists of subject domains and/or genres to which the measure applies or doesn't)
2.b Task dependence (suitability; tasks for which the measure is or is not appropriate)
2.c Language dependence (source or target languages for which the measure holds or does not)
3. Metric Sensitivity (for numeric measurements only)
3.a Accuracy (accuracy of measurement: confidence interval, error bars, correlation with human judgments, etc.)
3.b Variance (variance of measurement across tests; inter-evaluator agreement)
4. Coverage
4.a Completeness of the metric (scope; proportion/degree/percentage of the quality that the measurement is designed to measure)
4.b Completeness of the measurement (scope; proportion/degree/percentage of the quality that the measurement has measured)
5. Costs
5.a Cost to prepare test materials (taking into account its reusability later)
5.b Cost to perform measurement (if necessary, expressed per repeatable unit)
6. Resources/knowledge required
6.a What people/equipment/data/information is required to perform the measurement?
6.b What knowledge/skill is required of the evaluators?
6.c How much time is required to perform the measurement?
METRICS
The quality of evaluation can be evaluated in two modes:
1. Quality of the translation without adjustment
This aims to evaluate the quality of translation before the dictionary and/or grammar is adjusted. This is also an absolute evaluation of the system. See JEIDA report
2. Quality of the translation with adjustment
This aims to evaluate the quality of translation after the dictionary and/or grammar is adjusted. The higher quality of translation the user needs, the more severely the evaluation is made. In this respect this item shows the degree of the user's satisfaction with the system (JEIDA report).
All of the following definitions are taken from Van Slype's Critical Report, 1979.. It seems very important to keep them in mind before proceeding to further discussion of the quality features.
DEFINITION OF TRANSLATION
J. HOUSE -- Translation is the replacement of a text written in a source language by a semantically and pragmatically equivalent text written in the target language. (The translation of oral texts is a different activity, namely interpretation).
TRANSLATION QUALITY
L'ASSOCIATION JEAN FAVARD distinguishes: (a) the intrinsic qualities, which are independent of the reader; (b) the extrinsic qualities, which are related to the "text-reader" couple. A text, even badly translated (and thus of low intrinsic quality) can nevertheless, for an informed reader, be as clear as if it had been well translated. However, beyond a certain deterioration in intrinsic quality, the extrinsic quality becomes very poor.
H. BRUDERER -- Quality is a relative concept, i.e. one related to a specific object. Quality can apparently be measured, at least in part, but it remains much more difficult to quantify abstract (conceptual, subjective) phenomena than concrete (perceptible, real, tangible) things. Quality can be evaluated: (a) either positively assessment of merits, advantages; (b) or negatively assessment of deficiencies, errors, disadvantages; (c) or totally assessment of the positive and the negative aspects. The evaluation of the translation quality -- whether human or computerized -- has to take into account the following intralinguistic and interlinguistic factors morphology, syntax, content, terminology, style, conformity. A faithful translation reproduces the sense of the original text, but it does not necessarily, if it is to be considered an intelligent translation, have to be identical to the original text. Given that they partially overlap, content and fidelity should be evaluated on an overall basis. Similarly, it is difficult to differentiate clearly syntax and semantics. Style, on the other hand, influences all levels (morphology, syntax, semantics, terminology).
IR.L. JOHNSON defines translation quality by three factors fidelity, intelligibility and elegance. The importance of these three factors may vary with the type of text considered. Features can be observed: (a) superficially, via linguistic elements such as lexical and syntactic exactitude; (b) indirectly, via the reactions of the users to the translated text.
B. KUHLEN stresses that there is not a universal criterion for MT evaluation: (a) on the one hand because it does not seem that MT can ever reach the level of quality of human translation; (b)on the other hand, because the evaluation criteria have to be chosen according to the aim in view; (c) finally, because the individual parameters, which taken together permit an assessment of the quality of MT, often contradict each other, with the result that an overall rating would not be significant to the specific performance of the components.
Z.L. PANKOMICZ feels that usefulness of MT and HT has to be based on quality, speed and cost. Determination of the optimal balance between these three parameters depends on the environment of each translation activity. It is necessary to understand, in his view, that the quality of HT and MT is indefinable, at least in any absolute way. The assessment of the quality of HT is traditionally based on its completeness and on stylistic elements.
A.J. PETIT takes the view that the translation should not comprise misconstruction, but admits however a tolerance of up to I % of the sentences in the case of translations to be supplied raw to the final user and 2 % of the sentences in the case of texts to be revised before submission to the users. This tolerance is intended to allow for normal risks of error or accident.
Y. WILKS thinks that the purist who feels that the least translation defect nullifies the translation is often mistakes in two of his postulates: (a) he exaggerates the attention and comprehension which the average reader achieves with a technical document (consequently, errors of translation do not negate the value of the text); (b) he exaggerates the quality of the mass of human translations produced on an enormous scale and at high speed.
RELATIONSHIP BETWEEN TRANSLATION QUALITIES AND EVALUATION CRITERIA
According to G. BOURQUIN the criteria for evaluating a translation will vary according to whether it is produced by a human translator or by the machine: (a) from the human, "finesse" will be required open to the ethnoculture and to work on linguistics, the human translates with his sensivity, his intuition, his common sense; (b) the computer will be expected to offer regularity, precision, infallibility, speed, and encyclopedic exhaustiveness.
M. MASTERMAN notes that our ignorance of the very nature of translation leads to a discordance between the evaluation criteria used or proposed by various authors.
A.J. PETIT -- A product is acceptable only if it meets the requirements of its users. As regards texts (original texts or human or machine translations), the principal requirements are:
(1) For utility technical texts (maintenance or user manuals): (a) errors, (b) homogeneity, (c) clarity, without ambiguity or gibberish which might obscure the sense of the message, (d) simple correct style, without extravagances or recherche' elements, (e) use of the terms recognized in the relevant sector.
(2) For educational technical texts: (a) no technical errors, (b) adaptation of the terms recognized in the relevant sector.
(3) For documentary scientific texts: (a) clear exposition of theory, (b) without errors flowing style without excessively long sentences incorporating several different ideas, (c) use of the basic terminology of the discipline.
These requirements have however to be viewed from a different angle according to whether the translation is intended: (a)to be revised in this case, the translation system (human or machine) has to be aware of its own shortcomings, and indicate by itself all the ambiguities which it was not able to resolve it delivers an incomplete product, but one without serious defects; (b) to be supplied direct to the final user the translation must then be complete (experienced human translator or a computerised system producing a complete translation, without any misconstruction) and without serious defects (human error or accident both being normal risks).
THE AUTHORS OF THE REPORT PRESENTED BY PHILIPS distinguish between evaluation of translations with and without comparison with the source text. In the first case, it is necessary to assess in what measure the translation (a) reproduces which is stated in the original (for example contractual texts), (b) reproduces what the author of the original intends to say, with the certainty that the message is properly understood (for example translation of manuals). To assess the quality of a translation, it is necessary to answer the following questions. (1) On the aim of the translation: (1.1) does the translation reproduce the content of the original? (1.2) does the translation reproduce the formulations of the original? (1.3) does the translation reproduce the intention of the author? (2) On the type of text: (2.1) all the information presented? (2.2) can the translation achieve the desired effect? (2.3) have the necessary corrections been made in such a way that communication has the best chance of success?
In the second case, evaluation of the translation without reference to the original, the assessment of the quality of the translation has to cover: (a) the grammatical correctness, (b) style of idioms, (c) the use of current words, expressions and structures in the target language, (d) the absence of contradictions or ambiguities.
ASSESSMENT
The concept of the quality of a manufactured product is, in general, unambiguous the product has to correspond to the specifications and a battery of quality control tests can easily be arranged, and made the responsibility of controllers often relatively unqualified. The concept of translation quality is much more indeterminate, and the authors' contributions can be summarized fairly briefly. (1) The quality has to be assessed, not in the absolute, but according to the aims of the writer of the texts to be translated and by those who decide how it is to be distributed. (2) The quality achieved by HT can not be expected of MT, and the latter has thus to be used for more limited aims than the former (which does not mean that, within the scope of these limited aims, there does not exist a major potential demand). (3) The evaluation criteria have to be chosen according to these specific aims. (4) Since translation quality can not be measured in the absolute, on the basis of a single criterion, its assessment should combine several criteria.
Definition: The capability of the software product to provide functions which meet stated and implied needs when the software is used under specified conditions.
References for this characteristic: ISO 9126: 2001, 6.1.
Comments: This characteristic is concerned with what the software does to fulfil needs whereas the other characteristics are mainly concerned with where and how it fulfils needs.
Definition:
The capability of the software product to provide the right or agreed results or effects with the needed degree of precision (ISO 9126: 2001, 6.1.2).
Accuracy and its sub-qualities are established by reference to the source language text.
Metrics:|
Definition: Correction rate defined as the ratio of the number of words corrected to the number of words in the translation (Van Slype) Method: Count number of words corrected, number of words in initial translation. Measurement: Ratio of number of words corrected to the number of words in the translation. |
|
Definition: Correction rate defined as the number of insertions, deletions and substitutions - "edit distance" required to correct a text after translation (Ney and Niessen) Method: Count the number of insertions, deletions, substitutions to correct a text. Note that this metric can be automated. Measurement: Edit distance which is often a linear combination of the three counts. |
Definition: Correct translation of technical (domain-specific) terms.
Metrics:|
|
References for this characteristic:
Filatova, 2000.
Comments:
Names should be transliterated or translated (e.g. 'London' / FR: 'Londres') as appropriate.
This characteristic becomes very important in Assimilation tasks (1.3.1./113), inluding Information Extraction.
Definition:
Subjective evaluation of the degree to which the information contained in the original text has been reproduced without distortion in the translation (Van Slype).
Measurement of the correctness of the information transferred from the source language to the target language (Halliday in Van Slype's Critical Report).
Metrics:|
Method: Rating of sentences read out of context on a 9-point scale. Notes:(in Van Slype's Critical Report) |
|
Method: Rating on a 25-point scale. Notes:(in Van Slype's Critical Report) |
|
Method: Assessment of the correctness of the information transferred. Notes:(in Van Slype's Critical Report) |
|
Method: Rating of text units read on a 9-point scale. Notes:in Van Slype's Critical Report |
|
Method: Rating of a text on a 100-point scale. Notes:in Van Slype's Critical Report |
|
Method: Shannon measurement of the quality of information transferred. Notes:in Van Slype's Critical Report |
|
Notes:in Van Slype's Critical Report |
|
Method: Rating of sentences read on a 4-point scale. Notes:in Van Slype's Critical Report |
|
Method: Rating of 'Adequacy' on a 5-point scale. Notes:in DARPA 94 |
|
Method: Bleu evaluation tool kit Automatic n-gram comparison of translated sentences with one or more human reference translations. Notes:in Papineni et al. 2001 |
|
Method: Rank-order evaluation of MT system: correlation of automatically computed semantic and syntactic attributes of the MT output with human scores for adequacy and informativeness, and also fluency. |
|
Method: Automated word-error-rate evaluation. Notes:in Och, Tillmann and Ney, 1999 |
|
Method: Automated metric using head transducers. Notes:Alshawi et al, 2000 |
Comments: The fidelity rating has been found to be equal to or lower than the comprehensibility rating, since the unintelligible part of the message is not found in the translation. Any variation between the comprehensibility rating and the fidelity rating is due to additional distortion of the information, which can arise from:
loss of information (silence) - example: word not translated
interference (noise) - example: word added by the system
distortion from a combination of loss and interference - example: word badly translated
Detailed analysis of the fidelity of a translation is very difficult to carry out, since each sentence conveys not a single item of information or a series of elementary items of information, but rather a portion of message or a series of complex messages whose relative importance in the sentence is not easy to appreciate.
Some automated metrics assume a fidelity evaluation as a human ground truth, or are relevant to fidelity evaluation.
Definition:
Capability of the system to produce from a given input, and at a given point in time, the same output
Metrics:|
Method: Count the number of alternative translations for a given input unit. |
Comments:
Consistency is particularly important for developers and for the translation of technical documentation.
Definition: The capability of the software product to provide an appropriate set of functions for specified tasks and user objectives (ISO 9126: 2001, 6.1.1).
Definition: Qualities of the translation that can be evaluated solely on the basis of the output of the system in the target language.
Definition:
This has also been called fluency, intelligibility, and clarity.
The extent to which a sentence in the translated text reads naturally.
Ease with which a translation can be understood, i.e. its clarity to the reader. (Halliday in Van Slype's Critical Report) .
Metrics:|
Definition: Cloze tests are designed to test comprehension of a text by removing words at regular intervals in the text and asking readers to to identify the missing word Method: Take a number of translations produced by the system and remove or blank out cetain words. Give these amended texts to a group of readers and ask them to supply the missing words Measurement: The number (or percentage) of words correctly guessed by the readers Additional info: There are a number of examples of readability evaluations in the literature (see references). Some evaluators simply remove the nth word (where "n" can be any number) whilst others specifically choose to only remove content words. This metric is relatively costly to apply since it involves preparation of the test materials and a number of different test subjects - readers who have not seen the complete texts. The larger the number of test subjects and the larger the text, the more informative the metric will be |
|
Definition: This metric requires test subjects to give their subjective assessment of how easy it is to understand a text translated by the system. Method: A number of texts translated by the system are selected and presented to test subjects who are asked to rate the intelligibility of each text (or individual sentences)according to a predefined scale. Measurement: For each text or sentence: a point on a predefined scale Scale:pre-defined numerical scale (see notes) Additional info: As with all metrics based on human judgements, applying this metric can be costly depending on the number of subjects used. The greater the number of subjects the more indicative the metric is of intelligibility Notes:There are a number of examples of applying this metric in the literature (see references) and the number of points on the rating scales varies considerably (from 3, 4 7 or 9 point scales). However, it is generally recommended that an odd number of points on the scall should be used |
|
Definition: This is defined as the time required to read and understand a text, or to realize its unintelligibility, but not to memorize it Method: Test subjects are presented with texts translated by the system and are asked to read it until they can understand it. The evaluator measures how long the subject takes to read the text. Measurement: The amount of time to read a text of a pre-defined length Scale:minutes per text Additional info: Alternative metrics in a similar vein exist, for example timing how long it takes a reader to read each sentence, or how long it takes a readert to read a text and answer multiple choice questions on the text |
References for this characteristic:
Crook and Bishop (in Van Slype's Critical Report): Cloze test (every eighth word) and subjective intelligibility on 7-point scale
Halliday (in Van Slype's Critical Report): Clozentropy.
Sinaiko (in Van Slype's Critical Report): Multiple-choice questionnaire + Cloze test(every fifth word) + clarity measurement + time measurement + Rating of sentences read on a 3-point scale.
Carroll (ALPAC report): rating of sentences read on a 9-point scale.
Carroll and Bishop (in Van Slype's Critical Report): rating of sentences on a 7-point scale.
Leavitt (in Van Slype's Critical Report): rating of texts read on a 9-point scale.
Van Slype (in Van Slype's Critical Report): rating of sentences read in their context on a 4-point scale.
Vauquois (in Van Slype's Critical Report): rating of sentences read on a 2-point and 3-point scale.
Pfafflin (in Van Slype's Critical Report): Rating of sentences read on a 3-point scale.
Vanni & Miller (2001, 2002): "Do you get it?" - snap judgement rating of sentences on scale from 0 to 3.
Somers' use of cloze test (Somers and Wild, 2000).
B.H.Dostert in Van Slype's Final Report: asking final users to state what percentage of additional time they require to read MT, as compared to an original in their own language.
J.B. Carroll in Van Slype's Final Report): measuring the time spent by the evaluator in reading each sentence of the sample.
G. van Slype: measuring the time spent by the evaluator in reading each text of the sample.
Pfafflin and Orr (both quoted by T.C. Halliday): by measuring the response time to a multiple-choice questionnaire.
Definition:
The extent to which the text as a whole is easy to understand. That is, the extent to which valid information and inferences can be drawn from different parts of the same document.
Comprehensibility reflects the degree to which a complete translation can be understood (whereas intelligibility is based on the general clarity of the translation, whether this is considered in its entirety or by segments out of context). (Halliday in Van Slype's Critical Report).
Subjective evaluation of the degree of comprehensibility and clarity of the translation. (Van Slype in Van Slype's Critical Report).
Metrics:|
Method: Halliday Noise test Notes:in Van Slype's Critical Report |
|
Method: Multiple-choice questionnaire. Notes:in Van Slype's Critical Report |
|
Method: Multiple-choice questionnaire. Notes:in Van Slype's Critical Report |
|
Method: Knowledge test. Notes:in Van Slype's Critical Report |
Comments:
This has also been called comprehension or intelligibility.
Metrics bearing on the readability of single sentences, as opposed to the comprehensibility of the text as a whole, have been moved to the Readability feature (2.2.1.1.1.1/172)
Definition:
The coherence of a text is the degree to which the reader can describe the role of each individual sentence (or group of sentences) with respect to the text as a whole. Theories such as Rhetorical Structure Theory (Mann and Thompson, 1988) attempt to formalize coherence using a set of inter-segment relations (such as Cause, Solutionhood, Elaboration) that express the internal document structure.
Measurement of the total contextual coherence (T.C. Halliday in Van Slype's Critical Report).
Metrics:|
Method: Measure degree to which roles of each discourse unit can be identified with respect to a gold standard. |
|
Method: for example, measure this feature by counting the total number of sentences in the machine translated text to which RST labels can be assigned. Notes:See Mann & Thompson, 2001, 2002 |
References for this characteristic:
Carlson, Marcu & Okurowski, 2001.
Comments:
It has been asserted that the quality of a translation can be assessed by its level of coherence without comparing it to the original text. Once a sufficiently large sample is available, the probability that the translation should be at the same time coherent and totally wrong is very weak. (Wilks in personal communication, 1992, also cited in Van Slype's Critical Report).
According to the definition the assessment of coherence can be done by a monolingual evaluator, whereas any judgement on the correctness of the translation necessarily involves making use of a bilingual evaluator. (Wilks in Van Slype's Critical Report).
Definition:
Cohesion of a text refers to lexical chains and other elements -- for example lexical chains, anaphora, ellipsis -- that link individual units across sentences.
Metrics:|
Method: Does the system render cohesive units appropriately for the target language? |
References for this characteristic:
Special issue of MT journal on Anaphora, 2001.
Reiter, Mellish & Levine, 1995.
Comments:
Cohesion is particularly interesting for translation between languages that have different requirements for structuring and managing redundant information. For example, Asian languages make frequent use of ellipsis and zero-pronouns which often must be resolved on translation into languages where such use is not licensed.
Cohesion is also important when the translated text is intended for subsequent summarization (see Information extraction / Summarization, 1.3.1.2/115).
Definition:
Qualities of the translation that must be evaluated on the basis of both the source language and the output of the system in the target language.
Suitability of source-to-target mapping to a particular task.
Coverage of cross language phenomena concerns the ability of the system to deal satisfactorily with the commonly recognized differences between the source and the target languages, with or without taking into account the presence or absence of these phenomena in any particular corpus.
Metrics:|
Method: By use of a set of test patterns - these should be in the form of simple source language patterns that are theory neutral, that is, descriptive in pedagogical terms rather than in terms of a particular syntactic theory whose principles could obscure the issue. For a number of European languages, such test suites are available as a product of the TSNLP project, which focused mainly on syntactic phenomena, and theDiET project. The Japanese MT research community has also produced such test suites as part of the Jeida project. Commercial MT companies should also all have similar test suites: Logos and Systran both have test suites of this type. IBM has relevant test suites that were presented to the research community at LREC2000 in Athens and ACL-2001 in Toulouse. In order to arrive at a measurement, test suites of this type can be used, with either a correct/incorrect verdict for each sentence in the test suite, a percentage correct for each sentence (as long as the notion of "percentage correct is well-defined), or a (3 to 10 point) scale of correctness for each sentence. The agregate measure could be the percentage of sentences correct, the percentage of linguistic phenomena covered, or an aggregate measure of linguistic phenomena covered, weighted for phenomena important to the language pair and task of interest. It is also possible to use word error rate as a measurement, along the lines of automatic scoring of insertions, deletions, and substitutions relative to a gold standard (Niessen, Och, Leusch, and Ney, 2000), or as described in Vanni & Miller, 2002) and (Vanni & Miller, 2001). |
References for this characteristic:
Comments:
Whereas TSNLP and some other test suites of this type focus mainly on syntactic phenomena, test suites for general cross-language coverage should ideally address other cross-language phenomena as well: idioms, lexical and conflational divergences, etc.
Each commercial MT company should have such a test suite, which they may use for regression testing or for testing of improvements to the system. Ideally, in order to test systems from developers A and B, a test set covering the union of the phenomena covered by the two test suites should be used.
Definition:
This is a subjective evaluation of the correctness of the style of each sentence (Evaluation of the 1978 Version of the SYSTRAN English-French Automatic system of the Commission of the European Communities. Georges van Slype). This quality is also commonly referred to as "register" and includes degree of formality, forcefulness and bias as exhibited through both lexical and morpho-syntactic choices.
Metrics:|
Method: Evaluation of sentences on a 4-point scale. Notes:in Van Slype's Critical Report |
|
Notes:(Niessen, Och, Leusch, & Ney 2000) |
References for this characteristic:
Comments: This quality is distinct from Readability (2.2.1.1.1.1./172). A text may be highly readable but in an inappropriate style / register.
Definition:
Coverage refers to the ability of the system to deal satisfactorily with linguistic phenomena, both generally addressing known cross-language phenomena and specifically addressing phenomena in a corpus of interest.
Coverage of corpus-based problematic phenomena concerns the ability of the system to deal with the particular challenges presented by a corpus of interest.
Metrics:|
Method: By constituting a representative corpus and submitting it to the system in order to observe what errors occur. |
|
Method: Given a test suite of representative phenomena specific to the corpus of interest, low-level and aggregate measurements like those described in Cross-language phenomena (2.2.1.1.2/502) can be used. |
|
Method: Subjective human scoring on a 10-point scale. |
References for this characteristic:
Definition: Every MT system embodies some theory of language and of translation. Usually most of the theoretical assumptions are implicit, possibly not even known to the developer. These vary with respect to two general characteristics: the way in which knowledge of the translation process is represented and acquired (methodology) and how and at what point the different types of knowledge is applied during the translation process (models)
Comments:
Note also that there are multiple methods for measuring some of the qualities in this system. Some are more invasive, such as code review, grammar inspection, etc. Others are less so, as with the use of test suites. The level of testing granularity is applicable here, particularly when determining if the testing is glass-box (looking into the system structure / code) or black-box (seeing external behavior only).
Definition: The underlying theoretical methodology behind the development of a given system.
Metrics:|
Definition: The developer should provide a description of the theory and method of translation used by the system Method: Provision of supporting documentation such as white papers. Scale:Percentage of conformance. |
References for this characteristic:
Comments:
There is a variety of each type of system but especially so with respect to rule-based systems.
Generally, it is assumed that a theoretically sound system is easier to use, manage, update, etc. than one that is not theoretically based.
As important as current coverage is a systems capacity for update and improvement.
Many of the techniques used in verification of particular models tend to be glass-box, that is, isolating an element of the system, or potentially examining source code and data files.
Definition:
rule-based model, also known as "knowledge based", involves rules to analyse and represent the source text in a more abstract form as well as rules to map this abstract representation to the corresponding target text; these rules can be morphological, lexical, etc.
Metrics:|
Definition: If the system uses a grammar, test their relaxation capacity. Method: Two methods are running against a test suite and rule examination. Measurement: Number of grammatical relaxations permitted. |
|
Definition: If the system uses a grammar, determine their coverage. Method: Two methods are running against a test suite and rule examination. Measurement: Number of grammatical functions covered by the system. |
|
Definition: If the system uses a grammar, the form and number of the rules. Method: Specification of the form of the grammmar rules and counting of the grammar rules. Measurement: Conformance to standard grammar specification; number of rules. |
|
Definition: The developer should provide a description of the theory and method of translation used by the system. Method: The developer should provide white-papers and supporting documentation. Measurement: Confirmation of method by study of documentation. |
|
Method: Design and add/change grammatical rule to the system. Measurement: Yes or no: Can rules be added or changed? Notes:See ease of update in section 2.2.5.2.4 |
Comments: This characteristic is also related to the MT model used by the system (see "MT models" characteristic below).
Definition: These models rely on monolingual or bilingual texts (i.e. corpora), generally aligned, to which statistical methods are applied to obtain information about the source and target languages involved in the tranlsation. The information obtained can not be classified as knowledge (as in a rule based model) since it only indicates, in terms of probabilities extracted from the corpora, how often certain words appear togheter or can be inverted, etc. In other words, these models produce statistics about the strings in the corpora and use them in the translation process.
Metrics:|
Definition: The developer should provide a description of the theory and method of translation used by the system. Method: The developer should provide white-papers and supporting documentation. Measurement: Confirmation of method by study of documentation. |
|
Definition: Minimum size of the training corpus Method: Specification by developer of minimum training corpus size. Measurement: Yes or no: Size specification reported |
|
Definition: Accessiblity of training corpus or techniques Method: Specification by the developer of interface / tools for training corpus. Measurement: Yes or no: Training corpus is accessible |
|
Definition: Specification for training corpus preparation Method: Provision by developer of training corpus preparation tools / documentation Measurement: Yes or no: Training corpus preparation tools / documentation is available |
Comments:
It is sometimes assumed that statistical MT systems constitute a new type of translation model. In fact, they implement one of the above-mentioned models in a different way, by building the lexicons, transfer rules, etc., unfortunately using large collections of data to learn from statistically. There is no new 'statistical MT program'. IBM's CANDIDE system (Della Pietra, et al.) and the EGYPT system (Knight, et al.) are examples of direct replacement systems involving some word order reorganization.
Definition: These models base the translation on a large database of examples of texts in both the source and target language. The difference with knowledge-based models is that they do not use rules at all and the main difference with translation memory models is that in TMs it is still the user producing the final translation (whereas this is not the case in example-based models).
Metrics:|
Definition: The developer should provide a description of the theory and method of translation used by the system. Method: The developer should provide white-papers and supporting documentation. Measurement: Confirmation of method by study of documentation. |
|
Definition: Size of parallel corpus Method: Developer report of parallel corpus size Measurement: Measurement: Size can be reported in terms of bytes, sentence pairs or words per language. |
|
Definition: Accessibility of example corpus Method: Specification by developer of corpus accessibility Measurement: Yes or no: Corpus is accessible |
|
Definition: Form of examples Method: Specification by developer of example formats Measurement: Confirmation of example format specifications |
|
Definition: Number of examples Method: Counting of examples Measurement: Number of examples in corpus |
|
Definition: Source language matching technique Method: Specification of source language matching technique and parameters Measurement: Yes or no: is specification provided. Also, can use test suite to test coverage and flexiblity of source language technique. |
|
Definition: Ease of extending / adding examples Method: Test corpus Measurement: Percentage test items that can be added. |
References for this characteristic:
Somers, 2000
Comments:
See also ease of update in section 2.2.5.2.4
Definition:
A translation memory is a multilingual text archive containing multilingual texts, allowing storage and retrieval of aligned multilingual text segments against various search conditions.
Different translation memories differ as to the information stored along with the raw texts and the retrieval methods. This definition does not restrict translation memory to what is currently available in systems on the market.
A translation memory is a collection of multilingual correspondences with optional control information stored with each correspondence. This characterization abstracts away from the actual manner of storing the correspondences (one-one, one-many, or many-many).
The control information can include information about the source text of the correspondence, its date, author, company, subject domain. This information may be used in ranking matches.
When a translation memory is used to support a given direction of translation, we can identify one segment of each correspondence as the (stored) source segment and another one as the (stored) target segment. A given query with a current source segment may return a number of correspondences with matching stored source segments. (EAGLES).
Metrics:|
Definition: The developer should provide a description of the incorporation of translation memory and how it fits into the MT process. Method: Provision of supporting documentation Measurement: Yes or no: Does the documentation describe the role and function of translation memory? |
|
Definition: Size of parallel corpus Method: Developer report of parallel corpus size Measurement: Size can be reported in terms of bytes, sentence pairs or words per language |
|
Definition: Form and number of text segments Method: Specification by developer of form, granularity and number of text segments Measurement: 1) Confirmation by test or examination of form and granularity of text segments. 2) Count of number of text segments. 3) Test suites may be designed and executed in which case the measurement is percentage of test suite cases accepted. |
|
Definition: Type of control information permitted Method: Specification by developer of type of control information permitted. Measurement: Inspection of specifications. Number of specified control settings which work. |
|
Definition: Source language matching technique Method: Specification of source language matching technique and parameters. Measurement: 1) Yes or no: Is specification provided? 2) Use test suite to test coverage and flexibility of source language matching algorithm. |
|
Definition: Ease of extending parallel corpus. Method: Test corpus / test suite application Measurement: Percentage of test items that can be added |
References for this characteristic:
EAGLES Evaluation Standard for Translation Memory
Comments:
The incorporation of translation memory into traditional machine translation platforms is a relatively new and under-represented field of study, although a few examples do exist (AMTA-2002)
Definition:
Every MT system embodies some theory of language and of translation. Usually most of the theoretical assumptions are implicit, possibly not even known to the developer.
The simplest MT systems perform direct replacement of terms and phrases in the source language with target language equivalents. In addition, rudimentary word order changes may often be performed. Example-based systems (EMBT) are one type of this class; they replace phrases or even whole paragraphs at a time.
More sophisticated MT systems try to improve syntactic (grammatical) quality by analyzing the source sentence into a syntax tree and then converting the tree into the form required by the target syntax (for example, moving the verb complex). At the cost of building grammars and parsers, such syntactic transfer systems produce higher quality.
One level more complex, semantic transfer systems analyse the source text into some formalism that is intended to capture meaning, not just grammatical form. The formalisms used by shallow semantic systems are not fully language-independent and hence require some transformations into target form.
The most complex systems analyse the input into a language-neutral interlingual formalism, from which many target languages can be directly generated. No wide coverage interlingua has get been developed.
These levels of translation have been represented by the so-called MT triangle (Vauquois).
In general, the more sophisticated the internal processing, the higher the output quality, but the more domain-specific and brittle the system. Most modern working systems include a blend of syntactic and semantic transfer.
References for this characteristic:
Comments:
In practice, MT systems are not solely at one level of processing. In fact, fall back to a less complex strategy is often indicated when errors occur.
Definition: The simplest MT systems perform direct replacement of terms and phrases in the source language with target language equivalents. In addition, rudimentary word order changes may often be performed.
Metrics:|
Definition: The developer should provide a description of the theory and method of translation used by the system. Method: Receipt and review of the specification of method. Measurement: Yes or no: Is the system a direct translation system? |
|
Definition: The form and number of substitutions Method: Receipt and review of the form and number of substitutions Measurement: 1) Yes or no: Does the description of the substitution forms exist? 2) What are the number of substitution rules? |
|
Definition: Number of reordering operations Method: Receipt and review of the reordering operations list Measurement: Yes or no: Description of reordering operations exists? |
|
Method: Receipt and review of the reordering operations list Measurement: Count number of reordering operations possible |
|
Method: Test suite application Measurement: Number of advertised reordering operations that are carried out successfully. |
|
Method: Test suite application Measurement: Number of test suite items that are processed successfully |
|
|
Comments:
Note that the test suite methods mentioned here test whether the system process a particular rule, combination without measuring or addressing the overall quality as defined in section 2.2
Definition: MT systems which analyse the source text into a syntax tree and then convert the tree into the form required by the target syntax (for example, moving the verb complex) or analyse the source text into some formalism that is intended to capture meaning, not just grammatical form.
Metrics:|
Definition: The developer should provide a description of the theory and method of translation used by the system. Method: Receipt and review of the specification of the transfer method. Measurement: 1) Yes or no: Is the system a transfer translation system? 2) At what level of transfer does the system operate primarily? |
|
Definition: If the system uses a grammar, the form and number of grammatical analysis rules Method: Receipt and review of the grammatical rules in place Measurement: 1) Yes or no: Do the grammatical rules exist? 2) What number of grammatical rules exist? |
|
Definition: If the system uses a grammar, the coverage of the grammatical rules Method: (a) Receipt and review of the grammatical rules in place (b) Use test suite to test for grammatical phenomena coverage Measurement: (a) Analyzed coverage of grammatical rules for number of phenomena covered. (b) Number of test suite cases covered |
|
Definition: If the system uses a grammar, the ease of adding or changing the rules Method: Add or change a grammar rule Measurement: Yes or no: Can the grammar be changed? Notes:This is related to section 2.2.5.2. |
|
Definition: If the system uses a grammar, the relaxation capacity Method: (a) Receipt and review of the grammatical relaxation algorithm. (b) Use test suite to test for grammatical relaxations Measurement: (a) Analyzed coverage of relaxation algorithm for number of phemomena covered. (b) Number of test suite cases covered |
|
Definition: With respect to the transfer component, the form and number of transfer rules Method: Receipt and review of the transfer rules Measurement: 1) The transfer rules are described in a standardized form. 2) Number of transfer rules |
|
Definition: With respect to the transfer component, the coverage of transfer rules. Method: (a) Receipt and review of the transfer rules. (b) Use test suite to test for transfer phenomena coverage. Measurement: (a) Analyzed coverage of transfer rules for number of phenomena covered. (b) Number of test suite cases covered. |
|
Definition: With respect to the transfer component, the relaxation capacity of the transfer rules. Method: (a) Receipt and review of the relaxation mechanism for the transfer rules. (b) Use test suite to test for transfer rule relaxation. Measurement: (a) Analyzed coverage of the relaxation mechanism in transfer. (b) Number of test suite cases covered. |
|
Definition: With respect to the transfer component, the ease of adding or changing the rules. Method: Add or change transfer rules. Measurement: Yes or no: Can transfer rules be added or changed? Notes:This is also related to section 2.2.2.5 |
Definition: The most complex MT systems analyse the input into a language-neutral interlingual formalism, from which many target languages can be directly generated.
Metrics:|
Definition: The developer should provide a description of the theory and method of translaiton used by the system. Method: Receipt and review of the specification of the interlingual structure and methods. Measurement: Yes or no: Is the system an interlingual translation system? Notes:If the system uses a grammar, then the grammar metrics of Transfer (2.1.1.2.2/412) apply. |
|
Definition: Expressive power of the interlingual notation. Method: Analysis of the interlingual notation scheme. Measurement: 1) Yes or no: Does the interlingual notation exist? 2) How many levels of complexity for the notation exist? |
|
Definition: Coverage of the interlingual notation. Method: (a) Receipt and review of the interlingual notation, including markup instructions. (b) Use test suite to test for phenomena coverage. Measurement: (a) Analyzed coverage of interlingual notation (using formal methods). (b) Number of test suite cases covered. |
|
Definition: Representation of standard linguistic phenomena (e.g., sentence component promotion and demotion, phrasal expression, tense/time, aspect, etc.) must be given. See, for example, Ontological Semantics (Nirenburg and Raskin 2002). Method: Use test suite for phenomena coverage. Measurement: Number of test suite cases covered. |
References for this characteristic:
Definition:
This characteristic is concerned with linguistic resources such as bilingual dictionaries (lexicons), vocabulary lists, terminology, grammars and corpora along with the utilities to enable the user to use or modify the resources as well as to add new resources. This “internal” characteristic considers the existence and availability of the resources and utilities. Questions of their usefulness, efficiency and ease of use are considered under the so-called external characteristics which are properties of the running system.
"In order to provide users with a working system adapted to their environments, many translation technology products provide add-on dictionaries in certain subject areas and languages. Linguistic resources may also include the ability to create other bilingual, multi-lingual or reversible dictionaries to provide terminology quickly in other language pairs. The ability to enter additional information to the dictionaries or terminology database is also reviewed."
In order to ensure that terminology is consistent between multiple translators working in the same target language, it is essential for the product to offer facilities whereby the terminology can be shared and re-distributed as required. The way in which the product provides multi-user access to terminology is documented, together with any utilities for generating printouts and reports of the dictionary or terminology database contents.(OVUM report).
Paula16-05-06Swaped 2.1.2 accuracy with 2.1.1 suitabilityDefinition: "The range of languages which the product supports is a vital selection criterion. In machine translation systems, the languages are classified according to source and target language pairs, due to the need for full linguistic processing capability. In translator workbench products, the languages are not necessarily classified by strict language pairs as these products are interactive and therefore require only partial linguistic information. Terminology products have little or no linguistic ability and therefore the information only relates to the character sets which the product supports." (OVUM report)
Metrics:|
Definition: this metrics measures which language pairs the developer or manufacturer claims to be able to treat with their product. Introduction: In acquiring any translation technology it is clearly vital to ensure that the system or tool can treat the languages required by the user. The cheapest and easiest way to ascertain this is to check the claims of the system producer Method: Review the documentation provided by the developer/vendor of the system or tool and note the languages which are claimed to be covered for that product. If applicable this should be done for each tool or application contained in the product. Measurement: list of the languages (or directional language pairs) claimed to be treated by the product, with separate lists for each sub application (e.g. MT, terminology, translation memory Additional info: This is generally a very cheap and easy way to find out which languages are covered by the system or tool Notes: As noted above, different types of systems and tools impose different types of constraints on the languages covered. In complex systems involving both MT and translation memories and/or terminology acquisition and maintenance tools it is important to check which languages or language pairs are supported for each of those activities In cases where there either is no documentation or the evaluator for some reason does not have complete confidence in the documentation a more direct but much more resource consuming metric such as the following one could be applied. |
|
Definition: This metric is concerned with the languages supported by translation product Introduction: This metric is designed to establish which languages are supported by the product by actually testing the product on different languages. It should be applied in combination with the previous metric (i.e. checking which languages the documentation claims to be supported). In fact in many cases such inspection of the documentation should be sufficient and this metric should really only be applied in cases where there is no documentation or it is deemed to be unreliable Method: For each component tool of the product (MT, terminology management, translation memory etc) run the tool on texts, or other relevant resources in a variety of languages, and record whether it was possible to treat that particular language. Measurement: list of languages (and in the case of MT, directional language pairs) which the product can treat Additional info: For an evaluation intended to recommend a system for a specific user, only the languages which the user wants to be able to treat need to be tested. For a more general evaluation, the evaluator must find a way to decide which languages to treat. Clearly the more languages which are tested the greater the cost in time and resources Although this seems to be a quite straightforward metric care should be taken to ensure that in cases where the product appears not to be able to treat a particular language, this is not due to a different failing of the system such as that it cannot treat a particular type of text rather than a particular language |
|
Definition: this metric concerns whether it is possible to add new languages (or language pairs) to the MT system. Introduction: This metric provides an indication of how flexible the system is with respect to language coverage and provides the simplest form of metric. Method: Study the documentation to discover whether it is possible to add new languages or language pairs and whether this can be achieved by a user or is task for the developer/vendor of the tool Measurement: Can new languages be added? Can a user add a new language? Scale:yes/no Additional info: It is recommended that, if this quality (of being able to extend the languages covered) is important to the evaluation to also apply the metric for "changeability" to understand not only how new languages can be added but how feasible it is for users to add languages. In itself this metric is probably not particularly costly in terms of time or resources. However in combination with an assessment of how to add language pairs, it becomes more costly Notes:New language pairs could be creating using source / target of different language pairs while not adding an entirely new language to the mix. For instance, a system with French-English and English-German may be able to add French-German as a language pair. |
Definition: The kinds and number of dictionaries available. In this context, a dictionary is assumed to be equivalent to the term lexicon. In MT systems, the term tends to be used interchangeably. Another assumption is that the lexicon carries more information than a standard wordlist or glossary, including grammatical information. Finally, machine translation engines tend to have a general dictionary and other specialized vocabulary dictionaries, which are often called customer specific dictionaries.
Metrics:|
Definition: This metric is designed to ascertain the nature of dictionaries available with the MT engine, or which may be acquired separately Method: Examine the documentation accompanying the system and ascertain the kinds of dictionaries available or potentially available. Note whether there are domain specific dictionaries also available Measurement: List of the kinds of dictionaries available Additional info: Since this metric only involves looking at the documentation of the system it is very cheap to apply |
|
Introduction: The format of dictionary entries can help the evaluator or end user to assess how easy it would be to either import existing dictionaries or to update dictionaries Method: Examine either the documentation accompanying the system or the dictionaries included to ascertain the format used Measurement: Description of the format used. (If the dictionary uses a standard format, just the name will suffice) Notes: It may also be interesting to apply metrics listed under the characteristic "Changeability" to understand not only how dictionaries can be updated but also how easy it is to do. |
Definition: The kinds of wordlists or glossaries available. A wordlist is differentiated from a dictionary in that wordlists tend to contain only word pairs rather than the necessary grammatical information for analysis, transfer or generation. Glossaries, in this context, include phrasal lists and idiom lists. Idioms require special handling in the update case, as will be noted later.
Metrics:|
Definition: The different word lists, if any, which are provided with the system Introduction: he wordlists which are available with the system can give the evaluator an indication of the coverage of the system Method: The (potentially) available wordlists or glossaries are discovered by examining the documentation provided by the developer or vendor and noted in a list. Measurement: list of the available wordlists and glossaries |
|
Definition: This metric is designed to discover the formate of the wordlists or glossaries provided Introduction: It is important to know the format of the wordlists or glossaries in order to understand how to input new entries into the system. Method: By inspecting the documentation provided by the developer or vendor, ascertain the format of the glossaries and whether they match the format of the system Measurement: |A description of the format of the wordlists/glossaries |
Definition: The kinds and number of monolingual, comparable or parallel corpora available. The category of corpus will depend on the style of language modeling and statistical techniques used in the system.
Metrics:|
Method: Report by the developer. Measurement: List of described types of corpora - monolingual, comparable, parallel. |
|
Method: Report by the developer. Measurement: List of described numbers of corpora, categorized by type. |
|
Definition: Kinds of each type of corpora incorporated into the system. Beyond the type of corpora (monolingual, comparable, etc), there are the kinds. Kind will include domains, genre, dates of collection, etc. Method: Report by the developer. Measurement: List of described domains of corpora, categorized by type. |
Comments:
Note that the ease of update is covered in section 2.2.5.2
Definition: The types of grammars used during the translation process in a particular MT system and their nature, complexity, and coverage.
Metrics:|
Method: Specification by developer. Measurement: The type of grammar used should specify which of the formalisms (e.g., lexical functional grammar(LFG), generalized phrase structure grammar (GPSG)) it conforms to and any adaptations needed. |
|
Method: Analysis of grammar reported by developer. Measurement: Analyzed complexity in order of magnitude measures per number of input tokens. |
|
Method: (a) Analysis of grammar reported by developer (b) Test grammatical coverage using test suites Measurement: (a) Analyzed coverage reported in terms of linguistic constructs covered (b) Number and type of linguistic constructs covered in test suite |
Definition:
In addition to the operation of the system per se, other activities or processes must take place to enable successful MT operation.
Most translation technology products provide some facilities for customisation. This can range from machine translation systems that typically offer very little to some translator workbench products with many customisable features. The degree to which users can customize products to suit their own environment is a critical factor in selecting the most appropriate product.
This heading assesses user-definable features for areas such as project management and linguistic processing.
Comments:
Pre-translation is defined as modifying translation memory without notifying user.
Definition:
Translation preparation is related to transferring the source text into a form which the translation process can accept or which will facilitate translation.
The more the source text can be designed and created with translation in mind, the less work it will require when passing into translation process (OVUM report).
Metrics:|
Method: 1) Determine if system has feature through reading documentation. 2) Test the operation of the feature through one or more test cases. Measurement: 1) Yes or no: The feature exists. 2) The number successfully marked not-translate words. |
|
Method: 1) Review system documentation to see if this is a feature and how it treats these terms. For instance, in Chinese-English, if a Chinese term is marked as do not translate, is it transliterated or rendered in the native font in the output text. 2) Mark words as do not translate and run them through the system. Measurement: 1) Yes or no: Terms can be marked as do not translate. 2) Description of handling strategies for not translated words. |
|
Method: 1) Review system documentation to see if there is a maximium supported input text length. 2) Run documents under and over input text length to determine handling of out of bounds text. Measurement: 1) Length (in words or bytes) of largest input text permitted. 2) Description of error handling for texts larger than the maximum length. That is, are they split into separate texts, does the system crash, etc. |
|
Method: 1) Review system documentation to see if there is a maximum supported input sentence length. 2) Run suite of sentences under and over input length to determine handling of over-length text. Measurement: 1) Length (in words or bytes) of largest input sentence permitted. 2) Description of error handling for sentences larger than the maximum length. THat is, are they split into separate texts, does the system crash, etc. |
|
Method: 1) Review of system architecture to determine if module does a not-found check on words before translation process begins. 2) Review of intermediate system artifacts to see if marking occurs. Measurement: 1) Yes or no: Does the system pre-scan document for not-found words? 2) If so, how does the marking occur? |
References for this characteristic:
Trial of the Weidner System, 1985.
Jordan, Benoit and Dorr, 1993.
Comments:
This is more a part of the process flow as opposed to the translation process per se.
Definition:
Post-translation activities relate to preparing the output texts to meet the requirements for final publication or delivery (OVUM report).
Revision of output translation interactively to produce a final version for printing (Trial of the Weidner Computer-Assisted Translation System, p.12, October,1985). Sometimes this is referred to as the camera-ready copy.
Metrics:|
Definition: Availability of editing functions in system without retranslating (JEIDA report). Method: 1) Check the system documentation to check availability and operation of post-edit functions. 2) Test the operation of each function on test documents. Measurement: Description of the functions available with their operation parameters. |
Comments:
Traditionally, this has been an stage in the process requiring most of the time, for production-quality translations.
The designation of post-translation processing is often part of management control.
The amount of post-processing necessary can be used to assess the accuracy of the translation component (see the relevant charactericstics under Accuracy). The time taken to post-process texts is dealt with under the Time Behaviour characteristic.
Definition:
Interactive MT systems require user guidance at points when the system reaches an impasse during processing. The user's assistance (whether in the form of menu choices, parameter entry) constitutes a form of editing that can be called "inline editing" or "in-editing" (The Pangloss Mark III MT System).
Metrics:|
Method: Count number of times system requires assistance when translating a test corpus. Measurement: Number of steps needed or number of steps as percentage of test corpus size. |
|
Method: Measure the amount of time it takes to perform interactive translation on test corpus. Measurement: Amount of time for interactive translation on test corpus. Notes:This metric closely resembles (in intention at least) Input to Output translation speed listed under "Efficiency:Time Behaviour" below |
Comments:
This quality will not be appropriate for certain classes of MT process, such as embedded MT.
Definition: Facilities to assist users in researching and entering terminology which the machine does not recognize into the system's dictionary.
Metrics:|
Method: 1) Read system documentation to determine if produces a separate not-translated word list. 2) Establish structure of list, if it exists. 3) Run test suite through system and examine not-translated word list. Measurement: 2) Description of format of not-translated word list, to include transliteration conventions and system specfic markings. 1) Yes or no: Not translated word list produced. |
|
Definition: Ease of identifying source terms / their target language equivalents and grammatical information. Method: Check not-translated word list to see if contextual information is provided with the not-translated words. Measurement: Context is/is not sufficient for update. |
Comments:
The stage of dictionary update consists of entering in the machine's dictionary words or expressions the translator considers useful for future translations. It is recommended that the inserted terms are likely to be used in 20% of future texts (Trial of the Weidner System, 1985, p.12).
For metrics related to ease of dictionary update, see Changeability (2.2.5.2./213).
Definition: Degree to which the output respects the reference rules of the target language at the specified linguistic level.
Metrics:|
|
|
|
|
|
References for this characteristic:
Flanagan, 1994. (See also the LOGOS error list in the same AMTA proceedings).
Loffler-Laurian, 1983 (in French).
See also Arnold et al, eds., 1993 ('Machine Translation' 1993 vol. 8:1-2, special issue on evaluation).
Comments: We include here only the four most critical categories of error typically made by MT systems, though very often a more detailed classification is used. For example, SYSTRAN uses at least seven types of errors to rank the quality of the output: segmentation / tokenization, morphological analysis, homograph analysis, syntactic analysis, target language word selection, target language morphology, target language word order' target language grammar. All these errors are rated for severity. The severity ranges from "cosmetic" to "serious" when, for example, the meaning of original word/phrase is completely lost (L. Gerber, personal communication).
Definition: Degree to which the output respects the reference (inflectional) morphological rules of the target language.
Metrics:|
|
|
|
|
|
References for this characteristic: Miller and Vanni 2000
Comments: Inflections typically carry information about number, gender, case, tense,aspect, etc. This quality is especially important for highly inflected languages.
Definition: Degree to which the output respects the reference punctuation rules of the target language.
Metrics:|
|
References for this characteristic: Nunberg 1990.
Comments: A distinction may be made between punctuation errors which affect the meaning of the text and those which do not (Balkan 1991).
Definition: Degree to which the output respects the reference semantic co-occurrence restrictions of the target language.
Metrics:|
|
References for this characteristic:
Comments: Lexical errors arising from words or phrases that are inappropriate (in their collocations, connotations, register or idiomaticity), too general or too specific are also called "diction errors" in (Balkan 1991).
Definition: Degree to which the output respects the reference grammatical rules of the target language.
Metrics:|
Method: 5-point scale of syntactic correctness. Notes:ALPAC |
|
Method: 5-point scale of syntactic correctness. |
|
|
|
|
|
|
References for this characteristic:
Flanagan, 1994. (See also the LOGOS error list in the same AMTA proceedings).
Loffler-Laurian, 1983 (in French).
See also Arnold et al, eds., 1993 ('Machine Translation' 1993 vol. 8:1-2, special issue on evaluation).
Definition: The capability of the software product to interact with one or more specified systems.
References for this characteristic: ISO 9126: 2001, 6.1.3.
Comments: "Interoperability" is used in place of "compatibility" in order to avoid possible ambiguity with "replaceability".
Definition: The capability of the software product to adhere to standards, conventions or regulations in laws and similar prescriptions relating to functionality.
References for this characteristic: ISO 9126: 2001, 6.1.5.
Comments: In the context of machine translation systems, a number of de facto standards may be relevant here, for example standards for the open lexicon interchange format (OLIF) or standards for terminology exchange (MARTIF).
Definition: The capability of the software product to protect information and data so that unauthorized persons or systems cannot read or modify them and authorized persons or systems are not denied access to them.
References for this characteristic: ISO 9126: 2001, 6.1.4.
Comments: Safety is defined as a characteristic of quality in use, as it does not relate to software alone but to a whole system (this is note 2 from ISO).
Definition: The capability of the software product to maintain a specified level of performance when used under specified conditions (ISO 9126: 2001, 6.2).
Definition: The capability of the software product to avoid failure as a result of faults in the software.
References for this characteristic: ISO 9126: 2001, 6.2.1.
Definition: The capability of the software product to maintain a specified level of performance in cases of software faults or of infringement of its specified interface.
Metrics:|
Definition: Input tolerance for typing / conversion and other errors Method: Design and execute characteristic error test suites. Measurement: Percentage of ill-formed inputs successfully handled by the system. |
References for this characteristic: ISO 9126: 2001, 6.2.2.
Definition: The number of the system crashes per unit of time.
Definition: The capability of the software product to re-establish a specified level of performance and recover the data directly affected in the case of a failure.
References for this characteristic: ISO 9126: 2001, 6.2.3.
Definition: The capability of the software product to adhere to standards, conventions or regulations relating to reliability.
References for this characteristic: ISO 9126: 2001, 6.2.4.
Definition: The capability of the software product to be understood, learned, used and attractive to the use, when used under specified conditions.
References for this characteristic: ISO 9126: 2001, 6.3.
Definition: The capability of the software product to enable the user to understand whether the software is suitable, and how it can be used for particular tasks and conditions of use.
References for this characteristic: ISO 9126: 2001, 6.3.1.
Comments: This will depend on the documentation and initial impressions given by the software
Definition: The capability of the software product to enable the user to learn its application.
References for this characteristic: ISO 9126: 2001, 6.3.2.
Definition: The capability of the software product to enable the user to operate and control it.
References for this characteristic: ISO 9126: 2001, 6.3.3.
Comments:
(Note 1 from ISO) Aspects of suitability, changeability, adaptability and installability may affect operability.
(Note 2 from ISO) Operability corresponds to controllability, error tolerance and conformity with user expectations as defined in ISO 9241-10.
(Note 3 from ISO) For a system which is operated by a user, the combination of functionality, reliability, usability and efficiency can be measured externally by quality in use.
Definition: At the project-management level, customisable features include setting the level of access, setting up directories and file preparation and obtaining customized printouts (for example, of statistics).
Metrics:|
Method: 1) Read documentation to determine if feature exists. 2) Set multiple layers of data access. Measurement: 1) Yes or no: Does the feature exist? 2) Description of the layers of data access. |
|
Method: 1) Read documentation to determine if feature exists. 2) Set up directory structure and see if system accesses it properly. Measurement: 1) Yes or no: Can the directory structure be set up? 2) Yes or no: The system accesses it properly. |
|
Method: 1) Read documentation to determine if feature exists. 2) Prepare and track files within the framework. Measurement: 1) Yes or no: File preparation and tracking exists. 2) Yes or no: Documents can be prepared and tracked within framework. |
|
Method: 1) Read documentation to determine if feature exists. 2) Prepare and print customized statistics, such as usage. Measurement: Yes or no: The feature exists. |
Definition: TBD
Definition: The capability of the software product to be attractive to the user.
References for this characteristic: ISO 9126: 2001, 6.3.4.
Comments: This refers to attributes of the software intended to make the software more attractive to the user, such as the use of colour and the nature of the graphical design
Definition: The capability of the software product to adhere to standards, conventions, style guides or regulations relating to usability.
References for this characteristic: ISO 9126: 2001, 6.3.5.
Definition: The capability of the software product to provide appropriate performance, relative to the amount of resources used, under stated conditions.
References for this characteristic: ISO 9126: 2001, 6.4.
Comments:
(Note 1 from ISO) Resources may include other software products, the software and hardware configuration of the system, and materials (e.g., print paper, diskettes, etc.)
For different types of translation work the importance of this characteristic is different (see (1.3) Characteristics of the translation task (112)).
Definition: The capability of the software product to provide appropriate response and processing time and throughput rates when performing its function under stated conditions.
References for this characteristic: ISO 9126: 2001, 6.4.1.
Comments:
This characteristic is divided into a number of different sub-characteristics whose relevance depends on whether you wish to evaluate the time behaviour for the entire translation process or to consider parts of the process (such as pre- or post-editing separately from the time behaviour of the machine translation engine itself
Definition: This characteristic concerns the time between the request for a translation and reception of the final translation.
Metrics:|
Definition: This metric would be designed to evaluate how long it typically takes for a translation job to be completed from it first being commissioned until the finished product is delivered. This of course depends crucially on the quality of translation required and the users' particular set-up. Such a metric could only be applied in specific contexts of use and would require considerable reflection to ensure its validity. References for metrics: Van Slype quotes two sources for metrics designed to evaluate production time: B.H. Dostert (1973) and Z.L. Pankowicz (1978) |
Definition: This characteristic addresses the time required for pre-processing activities with regard to using an MT system. Pre-processing can be divided into three main types of activity:
1. Pre-editing. This can include formatting as well as style, grammar or spelling corrections. This sort of pre-processing is usually only performed by MT users who also have control over the source text.
2. Code-set conversion. This can include converting the text into a format which the MT system can process, either in terms of the character sets used or the document format
3. System preparation. This refers to cases where the user will first analyse the document(s) to be translated and then if necessary add unknown lexical items (or grammar/translation rules) to the system, before doing the translation
Metrics:|
Method: 1) Create a test suite of representative texts 2) Assemble necessary software modules if not incorporated into system. 3) Enable necessary software modules if incorporated into system. 4) Measure time required for pre-processing stages for one or more test texts. Measurement: Amount of time required for pre-processing. Notes:This metric is rather general and can be applied either to pre-processing as a whole or be more specifically applied to one particular activity such as pre-editing |
Definition: This characteristic concerns the amount of time it typically takes the system to carry out the whole translation process including any pre-processing which the system might perform automatically.
Metrics:|
Definition: This metric is designed to evaluate how long it takes the system to complete a translation Introduction: The purpose of this metric is to try and predict how the system will perform with respect to speed when it is deployed and applied to specific user tasks Method: 1. Collect a representative sample of source texts to be translated 2. Record the amount of text in the sample 3. Use the system to translate the texts and record how long the translation takesMeasurement: number of words translated per hour Notes: The measure as defined above is in terms of words per hour. However, it is perfectly possible to measure in terms of pages and days or seconds. The evaluator is advised to apply a measure which as far as possible reflects the user's normal method of calculating translation throughput In general the larger the sample used in the experiment the more accurately it reflects the time-related performance of the system. However in designing the evaluation there is a trade-off between such accuracy and the resources available to carry out such experiments The measure as defined above is in terms of words per hour. However, it is perfectly possible to measure in terms of pages and days or seconds. The evaluator is advised to apply a measure which as far as possible reflects the user's normal method of calculating translation throughput |
Definition: This characteristic addresses the time required for post-processing activities which occur after the MT system has been run. Post-processing can be divided into three main types of activity:
1. Post-editing.
2. Code-set conversion. This generally applies when the MT system uses different codesets than those used in the source text and/or by the commissioner of the translation. It can therefore be necessary to (re-)convert the codesets in the output text to match e.g. the required the character sets or document formats
3. Update.
|
Definition: Correction rate defined as the amount of time required to correct a text after the translation. Method: Time the correction of representative texts Measurement: Time to correct |
Definition: The capability of the software product to provide appropriate response and processing times and throughput rates when performing its function under stated conditions (ISO 9126: 2001, 6.4.2).
Definition: TBD
Definition: TBD
Definition: TBD
Definition: The capability of the software product to be modified. Modifications may include corrections, improvements or adaptation of the software to changes in environment and in requirements and functional specifications. (ISO 9126: 2001, 6.5).
Definition: The capability of the software product to be diagnosed for deficiencies or causes of failures in the software, or for the parts to be modified to be identified. (ISO 9126: 2001, 6.5.1)
Definition: The capability of the software product to enable a specified modification to be implemented.
References for this characteristic:ISO 9126: 2001, 6.5.2.
Comments:
Note 1 from ISO: Implementation includes coding, designing and documenting changes
Note 2 from ISO: If the software is to be modified by the end user, changeability may affect operability
Definition: TBD
Definition: This characteristic pertains to whether the system actually improves as a result of changes made to it.
Comments: By use of training sets and test sets of test materials (see Coverage [2.2.1.1.2.1/504]) to demonstrate incremental improvement without degradation.
Definition:
Facility of modifying the dictionary used by an MT system, most often regarding the addition of new words, phrases, grammatical roles, or senses.
Metrics:|
Definition: The average time needed to perform most common dictionary update operations. Method: Define a list of dictionary update operations, e.g. insertion of various types of words, etc. Use a pool of typical users of the MT system and ask them to perform the task. Compute average time, and note cases in which update could not be achieved (by some subjects, resp. by no subjects because of functional impossibilities). Scale:Absolute scales do not seem relevant here, apart from the proportion of updates that could not be achieved. However, different update methods can be compared using this metric. |
|
Definition: The average cognitive effort needed to perform most common dictionary update operations. Method: Define a list of dictionary update operations, e.g. insertion of various types of words, etc. Use a pool of typical users of the MT system and ask them to perform the task. Using a questionnaire, estimate the cognitive effort of required from the subjects. Scale:Absolute scales do not seem relevant here, apart from the proportion of updates that could not be achieved. |
Comments:
The implementation and availability of dictionary updating functions varies considerably with the translation model used by the system.
Definition: TBD
Definition: As part of translation preparation activities, the user may need to import different types of data (for an example from word processors, see OVUM report)
Metrics:|
Definition: Ease of importing data into the system (for an example from word processors, see OVUM report) Method: 1) Review system documentation for list of data types accepted by the system, to include file types, code sets, data formats. 2) Import data into the system for each file type, code set and data format advertised. Measurement: 1) List of file types, code sets and data formats supported by the system. 2) List of file types, code sets and data formats successfully loaded into the system. |
Definition: The capability of the software product to avoid unexpected effects from modifications of the software (ISO 9126: 2001, 6.5.3).
In the particular case of MT systems, this refers to ensuring that improvement in one area does not result in degradation elsewhere.
Definition: The capability of the software product to enable modified software to be validated (ISO 9126: 2001, 6.5.4).
Definition: The capability of the software product to adhere to standards or conventions relating to maintainability. (ISO 9126: 2001, 6.5.5).
Definition: The capability of the software product to be transferred from one environment to another.
Comments: Note from ISO: the environment may include organisational, hardware, or software environment (ISO 9126: 2001, 6.6).
Definition: The capability of the software product to be adapted for different specified environments without applying actions or means other than those provided for this purpose for the software considered.
Comments: Note 2 from ISO: If the software is to be adapted by the end user, adaptability corresponds to suitability for individualisation as defined in ISO 9241-10, and may affect operability. (ISO 9126: 2001, 6.6.1).
Definition: The capability of the software product to be installed in a specified environment.
Comments: Note from ISO: if the software is to be installed by an end user, installability can affect the resulting suitability and operability (ISO 9126: 2001, 6.6.2).
Definition: The capability of the software product to adhere to standards or conventions relating to portability (ISO 9126: 2001, 6.6.5)
Definition: The capability of the software product to be used in place of another specified software product for the same purpose in the same environment (ISO 9126: 2001, 6.6.4).
Comments:
Note 1 from ISO: For example, the replaceability of a new version of a software product is important to the user when upgrading..
Definition: The capability of the software product to co-exist with other independent software in a common environment sharing common resources. (ISO 9126: 2001, 6.6.3).
Definition: Cost here covers all of the monetary costs of introducing MT, maintenance costs implied by operational use of the system and the potential costs of not introducing MT.
Comments: This is not an ISO defined sub-characteristic, since it would normally play a part only in the management decisions to be made on the basis of a finished quality evaluation. However, in the MT context, cost may play a major role in disbarring a system from detailed evaluation. It is therefore included here as part of the quality model.
Definition: TBD
Definition: TBD
Definition: TBD