We open this section with a consideration of some past evaluations in the light of the notions introduced in chapter The Framework Model. No attempt is made at producing an exhaustive review: for that the reader is referred to (Falkedal94a) and to (Galliers93). The first gives a critical review of a number of machine translation evaluations, the second is concerned with natural language processing in general, using many insights from the field of information retrieval which has a mature and well-established evaluation methodology. Both have valuable extensive bibliographies.
The intention behind the choice of example evaluations here is to illustrate the variety of evaluation scenarios -- no judgement should be inferred as to the quality of the evaluation itself.
The Automatic Language Processing Advisory Committee's (ALPAC) evaluation of machine translation was one of the first evaluations (ALPAC66). For a fuller account, see (Falkedal94a) and (Hutchins86).
In terms of the framework set out in chapter The Framework Model the evaluation as a whole can be thought of as an adequacy evaluation, comparing machine translation to human translation on the three dimensions of speed, cost and quality. Here we shall focus on one part of the evaluation, Carroll's quality assessment experiment. This part can be thought of also as a progress evaluation, assessing how close the system was to the goal of producing translations comparable in quality to human translation.
The measure used was to take a set of translations, some produced by machine, some by human translators, and ask a group of test persons to rate the translations on two scales, one for intelligibility and one for fidelity (defined in terms of informativeness -- see quotation). The use of rating scales subsequently became widespread in evaluation of machine translation (see, for example, (Nagao88); but also see (JEIDA92) for a critical assessment and some more recent proposals).
The method was carefully designed. The test material was 144 sentences randomly selected from four different passages of a Russian book. Six different translations were produced for the 144 sentences, three by human translators and three by different machine translation systems. The translations were then merged randomly into six sets, with the constraint that each sentence appeared only in one translation in each set. Each set was then given to three monolingual and three bilingual test persons, all of whom had had one hour's training using a set of thirty sentences drawn from the same material as the test set. There were thirty six test persons in total.
A definition in English was given for each of the points on each of the rating scales. Thus, intelligibility was rated on a nine point scale from ``perfectly clear and intelligible'' to ``hopelessly unintelligible''. Fidelity, somewhat counter- intuitively, was defined over a ten point scale in terms of informativeness. The following is Carroll's own definition of the scale (ALPAC66, Appendix 10, pages 67-68) :
This pertains to how informative the original version is perceived to be after the translation has been seen and studied. If the translation already conveys a great deal of information, it may be that the original can be said to be low in informativeness relative to the translation being evaluated. But if the translation conveys only a certain amount of information, it may be that the original conveys a great deal more, in which case the original is high in informativeness relative to the translation being evaluated.
That this definition is somewhat counter-intuitive can be deduced from how frequently it is inaccurately reported in the literature on evaluation.
The committee reached extremely negative conclusions about what could be hoped for from machine translation systems in the short to medium term. Although the report as a whole has become ill-famed and provoked much controversy both about the possible bias of the committee's membership and about the validity of the evaluation itself, the ALPAC evaluation must be considered a pioneering effort if only because it emphasised the importance of good evaluation methodologies. In our terms, though, it is clear that the measures used involved only judgements and that the perverse definition of the fidelity scale must cast some doubt on their validity.
There have been many more recent machine translation evaluations, most frequently adequacy evaluations carried out on behalf of a potential customer. In adequacy evaluation, a great deal of effort is typically required to determine what the potential customer's needs really are. Amongst many others, a customer normally will intend to use the system to translate only certain kinds of text. This need is often reflected in the use of one or more test corpora submitted for translation. In the most common case, the potential customer will have to construct a corpus reflecting his specific needs if he decides to use one.
Before leaving the topic of adequacy evaluation, it is worth making a point which is valid for commercial natural language processing systems in general. Except at the lower range, where relatively modest products such as spelling checkers can aim at relatively exhaustive coverage of the language dealt with, it is rare to find a product which will do all and only what the customer wants. Frequently, the system will have to be modified or extended to meet specific needs. Thus, evaluation is aimed at finding out not only what the system currently does but also how easily it can be modified.
Our next example comes again from machine translation. From 1992 onwards ARPA sponsored a series of evaluations of machine translation systems. The report here is based on (OConnell94). The implied needs in the ALPAC evaluations were those of someone responsible for producing translations (speed, cost, quality). In the ARPA case, the needs are those of the funding agency. The declared aim of the research programme is to ``further the core technology''. The funding agency therefore needs a comparative evaluation of systems based on different technologies and translating from different languages into English. The difficulty of the task was further compounded by an invitation to operational systems (commercial or otherwise) from outside the research programme to participate in the evaluation exercise. Furthermore, there were great differences in the way the systems were intended to be used. At one extreme, one system was planned as a fully-automatic batch-oriented system, at the other was a system intended more as an on-line aid to a human translator than as a translation system.
Given all these constraints, the only quality characteristic which offers any hope of comparability is functionality, and that only if it is interpreted in the widest sense to allow the output of a machine-aided human translation to be compared with the output of a fully automatic machine translation. Two attributes of functionality were picked out: comprehensibility of the output and quality of the output.
In an attempt to produce direct comparability across systems translating from different languages, the test materials in the 1992 evaluation were constructed by taking a set of English newspaper articles about financial mergers and acquisitions, and having them professionally translated into the source languages the different systems worked from. The systems then translated the translations back into English and it was these outputs that were evaluated.
The metric associated with the comprehensibility attribute took the form of a comprehension test, where mono-lingual speakers of English were presented with the outputs, with output from control processes and with the original English, as well as with a set of multiple choice questions on the content of the articles.
A first problem with the validity of this metric appeared almost at once. The wide range of competence over the same task that humans can display became one of the major issues in the ARPA series of evaluations. This surfaced in the preparation of the test material. To quote:
However, it is evident that any human manipulation, even professional translation, has too great a potential of modifying the content of a text, and thus there is no way to tell whether a particular result reflects the performance of a system or the competence of the original translation from English (OConnell94).
In subsequent evaluations, although the comprehension evaluation was retained as a ``valuable measure of the informativeness preserved by an MT system as it translates an original foreign language text into English'', the back-translation method of preparing test materials was dropped.
The metric associated with the quality attribute was based on a standard US Government metric used to grade the proficiency of human translators. A panel of professional, native speaking translators of the relevant languages was asked to carry out the grading. This metric proved to be neither valid nor reliable. The grading limits of the original metric had to be changed to take account of the nature and the proliferation of errors in the machine translation output, and it proved exceedingly difficult for the quality panel to reach a consensus. The metric was dropped in subsequent evaluations.
The three subsequent ARPA evaluations have retained the comprehensibility attribute but have replaced the quality attribute with two sub-attributes, adequacy and fluency.
The metric associated with the adequacy attribute requires literate, monolingual speakers of English to make judgements determining the degree to which the information in a professional translation can also be found in a corresponding machine translation output or a control text of the same length. The information units are fragments, delimited by syntactic constituent, and containing enough information to permit comparability.
The measure of fluency is to ask the same set of persons to determine, on a sentence by sentence basis, whether the translation reads like good English. This is done without reference to a correct translation, so the accuracy of the content does not influence the judgement.
All of these three metrics are vulnerable to criticism. A comprehension test is typically used to test a human being's intelligence. How much does the test persons' intelligence interfere with the validity of the comprehensibility metric? Along similar lines, the adequacy metric is based on comparison between the information contained in a professional translation and in a machine output. To what degree does the competence of the human translator responsible for the professional translation interfere with the validity of this metric? The fluency metric relies entirely on subjective judgements: anyone who has ever produced a linguistic example for others to comment on knows to what an extent human judgements of well-formedness and fluency can differ.
However, the particular circumstances of the ARPA evaluations give rise to problems, specific to that context but sufficiently grave as to draw attention away from the more minor problems. Briefly stated, human involvement in the production of the output to be evaluated poses great problems. For the systems involved in the ARPA evaluations, human involvement covered all of post-editing of automatic machine translation, query-based interaction with a human during the translation and actual composition of the translation by a human with aid from the machine translation system. As might be expected, the competence of the human as well as his familiarity with the tools he is using greatly affect the quality of the output and the speed with which it is obtained. Disentangling the contribution of the human and the contribution of the system has proved an impossible task.
Although the severity of this problem is an artefact of ARPA's desire to compare radically different types of systems, the general point remains: ensuring that a metric measures what it is supposed to measure and only that is both critical, and, except in the case of metrics which can be completely automated, very difficult to achieve.
Neither ALPAC nor the ARPA evaluations we have discussed have come up with a very satisfactory way of evaluating machine translation systems. It is legitimate to ask why, and also to ask whether another approach might be more promising.
The answer to why, we think, has to do with the elusive notion of quality, and with the nature of translation itself. The ARPA MT evaluations do not stand alone. There is a strong ARPA tradition of comparative evaluations: the ATIS (Boisen92) evaluations concerned data base query systems with a strong emphasis on spoken language; the TREC (Harman) evaluations concern text retrieval; and the MUC (MUC391) evaluations fact extraction. Although all of these, and the MUC evaluations in particular, have stimulated discussion about evaluation techniques and methodologies, none has aroused quite the strong sense of unease that the MT evaluations produce. We suspect that this is because, in the other cases, it is possible, in one way or another, to pre-define what counts as the correct answer to the problem the system is trying to solve and to evaluate a system in terms of its capacity to reproduce that answer. In the case of text retrieval, for example, it is possible to specify which texts out of a set of texts are relevant to a particular request and to evaluate a system in terms of its ability to identify automatically all and only those texts in response to the same request. This means that it is possible to define the attributes of the functionality quality characteristic quite precisely and to create and validate appropriate metrics. People may then discuss the definition of the attributes or argue about whether a different metric might not be superior in some way, but the definitions themselves are clear and unambiguous.
In the case of translation, it is impossible even to imagine making the same move. There is no such thing as the correct translation and one cannot imagine artificially constructing the right answer which a machine translation system should arrive at, anymore than one can imagine grading human translations by comparing them with some single perfect translation. Thus, the functionality quality characteristic cannot be given a definition in terms of a set of outputs which are the only acceptable translations of a given set of source text inputs: translation quality cannot be defined in the abstract, therefore no more can the quality requirements of a translation software be defined in the abstract.
The way out of this dilemma, we believe, is to think again about the ``stated or implied needs'' input to the quality requirements definition, and see them as the needs of the users not of the translation software, but of the translations produced. Translations are used for many different purposes, ranging from gleaning enough of the content of the original to know whether to put it into the waste-paper basket or not, to establishing legislation or attempting to convey the essence of a great work of literature. If we drop the idea of trying to define some abstract notion of translation quality and set about trying to find ways of measuring whether a translation is good enough for some specific purpose, this may prove more fruitful.
Space constraints prevent any fuller discussion of the wide variety of evaluation scenarios and techniques reported in the literature on machine translation. (Falkedal94) and (MT94) are recent collections of papers where many of them can be found.
Data base query is another application of natural language processing with a long history of evaluation. (Woods73) describes informal field testing of the LUNAR system through monitoring the treatment of 110 queries during demonstration of the system and (Damerau80) reports more extensive field testing of TQA, a transformational grammar based front end linked to a pre-existing data base of town planning data, over a period of two years from late 1977 through 1979. Both of these were clearly adequacy evaluations, with the interesting characteristic of being executed in close collaboration with the end-user community. The emphasis on field testing of data base query systems is reflected also in recent work (Jarke85; Whittaker89).
In this section, though, we shall concentrate on a proposal made in the context of progress and diagnostic evaluation by a group at Hewlett Packard (Flickinger87). They argue that although no evaluation tool could be developed for use with natural language processing systems in general, it should be possible and useful to develop a methodology for a single application domain (data base query) in a context where there are common assumptions.
The main quality characteristic considered relevant for evaluation of a generic system (i.e. a system not specifically tailored for use with one particular data base) is the functionality of the system. The relevant attributes are linguistic and computational: the system should be able to treat a wide range of linguistic phenomena and should be able to generate the correct data base query from the natural language input.
A test suite was constructed to provide data for various measures relevant to these attributes. The test suite consists of a large number of English sentences annotated by a construction type. The sentences cover a wide range of syntactic and semantic phenomena, including anaphora and intersentential dependencies. Ungrammatical examples are included. Vocabulary is limited.
The method is only described in very general terms: the sentences are processed by the system being evaluated, the data base query generated and the resulting query used to query the data base. The results provide data relevant to a number of different measures: accuracy of lexical analysis, accuracy of parsing, accuracy of domain-independent semantics, correctness of the data base query generated and correctness and appropriateness of the answer. As might be expected, most of these measures are intimately related to the theory incorporated into the system. Accuracy of parsing, for example, can only be measured against what the theory of parsing implemented defines as a correct parse. The close connection between measures and theories underlying the system is typical of diagnostic evaluation; the purpose of the exercise is to provide feedback for the research workers developing the system on where modification or extension is needed.
Because knowledge of the internal workings of the system and of its theoretical underpinnings is required, the evaluation is a glass-box evaluation. This is in contra-distinction to the black-box evaluations typical of adequacy evaluations of market products, where the manufacturer will most frequently deny access to intermediate results and will describe the underlying technology only in the vaguest terms in order to protect what he sees as his own commercial interests, with the result that the evaluator must work only with the outputs produced by particular inputs and perhaps a tertium comparationis in terms of a pre-defined expected output.
The work described here provoked a strong interest in the construction and use of test suites, reflected in several of the papers contained in (Falkedal94) and in a number of on-going research projects, such as the TSNLP project discussed in section TSNLP -- test suites for NLP.
For our last example, we turn to the domain of fact extraction, in a context where evaluation is used to stimulate and guide research. Since 1987, the Defense Advanced Research Projects Agency (DARPA) in the USA has sponsored a series of evaluations of message understanding systems. The task is to extract material for a structured information base from a variety of naturally occurring texts. Four conferences have taken place; the description here will concentrate on the third, MUC-3. Fuller accounts can be found in (Chinchor91), (Lehnert91), (MUC391) and (Sundheim91).
The type of evaluation is black-box, although analysis of the results in the light of the text analysis techniques used by particular systems can give some clues about how well particular techniques fare. The basic strategy is to define a goal by creating a test collection consisting of a set of texts and a set of relevance criteria for the texts and for the information to be extracted. A set of ``answer templates'' is then defined, and the system evaluated by comparing the templates it produces with the answer templates. Fifteen systems participated in MUC-3. The evaluation is thus a comparative evaluation of these systems' adequacy in fulfilling the particular task. Re-testing a system after modification using the same material can also produce an evaluation of that system's progress.
Once again, the only quality characteristic taken to be relevant is the system's functionality. The attribute is the system's ability to extract essential information of the specified kind and the measure is the number of template slots correctly filled in each case.
MUC-3 used a corpus of 1,600 articles, each about half a page long drawn from a variety of text types, including newspaper articles, television and radio news, speech and interview transcripts, etc. The articles were divided into a training set of 1,300 texts made available to all fifteen participating sites and a test set of 300 articles. The articles covered a wide range of linguistic phenomena and included ungrammatical input.
The fifteen different systems included pattern matching systems, where there was fairly direct mapping from text to slot fillers, syntax-driven systems in which a traditional syntactic structure was produced and input to subsequent processing, and semantics-driven systems, guided primarily by semantic predictions but perhaps also using some degree of syntactic information and/or some pattern matching.
The specific task was to extract information on terrorist incidents, such as incident type, date, location, perpetrator, target, instrument, outcome, etc., from the 300 articles in the test set. Many of the articles were irrelevant to the task and relevant articles did not only contain relevant information.
In preparation, all participants manually generated an agreed set of filled templates from the training set according to a set of relevance criteria and rules refined during the generation process. Performance for systems on the test set of texts was evaluated by reference to an independently provided set of answer templates for the articles and information that ought to have been selected. A semi-automated scoring program was developed to calculate the various measures of performance. The two primary measures were completeness (analogous to recall in information retrieval) and accuracy (analogous to precision in information retrieval). Completeness was calculated as the ratio between the number of template slots filled correctly by the system and the total number of filled slots in the answer template. Fills corresponding exactly to the fill in the answer template scored 1.0, whilst fills judged by humans to be a good partial match scored 0.5. Accuracy was the ratio of slots correctly filled to the number of total fills generated. Two other measures, over-generation and fallout, were also used, but will not be discussed here.
The method was particularly stringent. The test could be executed only once. Systems that crashed were allowed to restart but were not allowed to re-process any message that caused a fatal error. (Lehnert91) note that this procedure resulted in scores for some sites that did not reflect true system capability. They also report that despite the stringency of the method, eight sites achieved at least 20 completeness and 50 accuracy. Two systems exhibited completeness scores over 40 with accuracy over 60.
If we look at the results in general, all systems performed better on accuracy than on completeness. The linguistically based systems could not adjust by increasing completeness at the expense of accuracy. Many participants, when information derived from sentences had to be re-organised into target template instantiations, identified discourse analysis as a major trouble spot. No participant claimed to have a satisfactory discourse component. The four top scoring systems differed very widely in their text analysis techniques. One high-ranking system worked with a 6,000 word dictionary, no formal grammar and no syntactic parse trees, and a close competitor operated with a 60,000 word dictionary, a syntactic grammar and syntactic parse trees for every sentence encountered. However, when the systems were ranked according to the highest combined accuracy and precision scores, the top eight systems were all natural language processing systems rather than systems using exclusively stochastic or inductive methods.
As already mentioned, the MUC evaluations stand in a tradition created and maintained by DARPA and ARPA of careful and rigorous evaluation techniques being used for comparative evaluation of a number of systems which may all work according to quite different principles. Although such evaluation provides clear information on a system's adequacy with respect to the particular task in hand, it is difficult to estimate its usefulness in predicting the capacity of individual systems to evolve and progress or the portability of systems trained on texts drawn from one domain to a different domain. Nonetheless, it cannot be denied that the existence of test collections such as those developed for MUC-3 constitute a valuable resource for the research community.
As noted earlier, the same philosophy of applying black-box evaluation comparatively to a number of different systems working on different principles is now also being adopted with on-going research on machine translation. The evaluation methodology, however, is rather different: with the MUC evaluations it is possible to specify the task to be accomplished quite precisely, since a critical element of the test collection is a set of correct answers. This cannot be done for machine translation; it is in the nature of translation that, for any given text, potentially many translations would all be equally acceptable.
Even the very small number of evaluations described here is enough to support the conclusion that evaluations vary enormously in their purpose, in their scope and in the nature of the object being evaluated. Consequently, it is hardly surprising that evaluation techniques in their turn differ widely, as do the resources they require. It is this observation which led Flickinger et al., as noted earlier, to the conclusion that it is in principle impossible to envisage the design and construction of some general evaluation tool, into which any natural language processing system could be plugged in order to obtain data relevant to a set of informative measures. (Galliers93) formulate the same hypothesis rather nicely:
The many elements involved in evaluation, perspective and levels on the one hand and system structure and applications on the other, mean that it is quite unreasonable to look for common, or simple evaluation techniques. The Workshop ...instead emphasised the need for very careful analysis of what is involved and required for any individual evaluation, and the limited extent to which evaluation techniques can be effortlessly, routinely or even legitimately transferred from one case to another. It is essential to determine comparability before such transfers can be made, and as in many situations true comparability will be very restricted, what can be common in the field is rather a set of methodologies and an attitude of mind. Thus evaluation in NLP is best modelled by analogy with training the cook and supplying her with a good batterie de cuisine.
We might also ask whether it is not possible to share some of the resources used in evaluation. It is intuitively clear that test materials like test collections or test suites are expensive to produce and maintain: there would be obvious interest in producing materials which could be shared by the community as a whole and re-used in different evaluations. Furthermore, use of the same test material might be expected to produce evaluation results which could more easily be compared, thus leading to another kind of shared resource.