The report contains a main body, which sets out the general framework proposed by the group and a certain amount of background material, accompanied by substantial appendices which report on the work in particular areas.
The main body starts by recapitulating the starting point, an existing Standard, ISO 9126, and explaining the extensions to that Standard made by the group. Most of these extensions come from an attempt to think of evaluation not in terms of the desires of one particular specific user/customer, whose needs and constraints can be built into the evaluation, but in terms of a general evaluation methodology which, catering potentially for the desires of a wide variety of different classes of user, can be tailored in any given case to reflect specific desires.
The following section describes a first attempt to make such an evaluation concrete in the form of a parametrized test-bed. The basic idea here is that it should be possible to construct: descriptions of products (the objects of evaluation); descriptions of classes of users (the customers of evaluation); and descriptions of attributes of systems potentially of interest to classes of users, coupled with metrics which, when applied to a product, would provide a value for that product for each attribute. Then, when an evaluation of a particular product or class of products is to be performed on behalf of a particular class of users, the desires of that class of users can be used to pick out the attributes and metrics pertinent to the specific case. Although some of the metrics lead to automated tests, others are by their nature not completely automatable. In these cases, the test-bed produces a set of instructions for the human on how to conduct and report on the test.
In order to be able to produce such a parametrized test-bed, the descriptions of products, of users and of tests need to be described in formal terms. This section therefore also contains some preliminary work on a formalisation in terms of attribute-value structures of the sort familiar from work in computational linguistics.
The characteristics of classes of users need to be discovered and formalized. The following section describes and discusses some preliminary work on techniques for doing this, drawing inspiration from recent work on requirements analysis in the software engineering field.
The final parts of the framework model chapter (2.5 and 2.6) review experience in software engineering and the methods used there for carrying out an evaluation. The emphasis is on experiment design, and the advantages and disadvantages of a variety of measures and methods are discussed.
The second part of the main body of the report contains background material. The first section reports on the use of consumer reports in other areas. As the reader will discover, the consumer report paradigm has served as an important part of the framework model. Next, brief summaries are given of work within some of the LRE (Language Research and Engineering) projects which have collaborated with or made use of EAGLES work An account of relevant previous work on evaluation is also given.
It is intended that the main body of the report can be read independently of the appendices, although it should be remembered that the attempts to put the framework model of the main body into practical application are to be found in the appendices, and that the reader might therefore find it useful from time to time to look for detailed examples there.
The first appendix gives an overview of relevant ISO terms and guidelines.
The second appendix provides a selection of methods for the measurement of software, focusing on the special problems that software evaluation poses in the NLP area.
Then follows an appendix describing investigations into issues and methods for user profiling and requirements analysis for language engineering evaluation. The appendix is a fuller version of the section on requirements in the main body of the report.
The appendices then cover first the application of the framework to the evaluation of writers' aids and translators' aids These two appendices are substantial attempts to work out detailed applications of the framework. They are however far from being complete: although much work has been done on evaluating the functionalities of the products considered, much less has been done on characteristics such as usability where there were less solid starting points.
The appendix on writers' aids is supplemented by an appendix giving a detailed account of testing carried out on grammar checkers. The appendix on translators' aids will be supplemented in the near future by detailed accounts of testing carried out on translation memories.
The next appendix reports on the rather more limited work we have been able to do on the evaluation of knowledge management systems. As we have already mentioned, this work takes the form of working out a set of requirements specifications for such systems, from which evaluation criteria may subsequently be deduced.
The final appendix is also rather different in nature. It consists of a detailed descriptive study of a very large translation service, that of the European Commission. It was decided to carry out such a study after earlier work on user profiling, reported on in the interim report, (EAGLES94), led to the realisation that a translation service as large and as complex as that of the Commission did not fit easily into any neat pigeonhole designed to hold the profile of a typical user.
We close this guide by reminding the reader once again that they are reading a report on work in progress: it is never fully fashioned, frequently incomplete and sometimes even fragmentary. We would be grateful for reactions and indeed for help, as work on standardisation is only validated and advanced through feedback and collaboration. Comments may be communicated in various forms: