The ISO 9126 Standard sets out a general framework for designing an evaluation. The EAGLES group, together with the TEMAA LRE project, seeks to apply this general framework to the evaluation of products in the two areas of writers' aids and translators' aids. (In earlier reports, we talked of ``adequacy evaluation of market or near-market products'', meaning by this that we were interested in finding ways by which someone could discover whether a particular product was adequate for their purposes.) This has led to augmentation of the ISO 9126 Standard.
The most important of these augmentations concerns the formulation of stated or implied needs, which, it will be remembered, are the primary input to the quality requirement definition. The EAGLES work aims at producing an evaluation package from which different elements can be taken and combined in different ways to reflect the needs of any particular user. There are, here, no stated needs, in the ISO 9126 sense of contractually binding specifications. What is in question are the implied needs of classes of users, which must be worked out through user profiling and requirements analysis techniques. So far, it has not been possible to do very much work on defining appropriate techniques for characterising users and their needs, but their importance has become increasingly obvious and is emphasised in the section on requirements analysis later in this chapter.
Thinking in this way of defining the needs of classes of users and allowing the specific user to identify which of those needs fit his specific case leads to what in EAGLES is called the consumer report paradigm. Consumer associations often publish reports on classes of products, such as dishwashers or motor cars. Individual products within each class are evaluated on a number of different dimensions and the results published in the form of a table, which gives a score for each product on each dimension. Thus, for dishwashers we might find something like:
|Product name||Capacity||Programmes||Water consumption||Cleanliness||Price|
|XXX||12 services||6||25 litres||good||1,259|
|YYY||6 services||4||10 litres||average||350|
|ZZZ||9 services||4||15 litres||poor||965|
Such tables typically do not try to make the user's choice for him: they aim to pick out characteristics of the products in question which are believed to be relevant to a user or, more plausibly, a class of users, and then to present the raw data which will allow the individual user to make an informed judgement about which product is most likely to suit his needs.
In EAGLES, we have used this idea as a way to structure our thinking about how to design an evaluation of language engineering products. We use the ISO 912 quality characteristics as a starting point to identify attributes of products which are potentially of relevance to a class of users. We then define measures and methods whereby values for those attributes can be determined. However, we make no attempt to say in absolute terms what the relative importance of individual attributes is, or what the critical values of those attributes might be. This is determined as a function of the needs of the individual user who is a customer of the evaluation. To restate this using the terminology of ISO 9126, the quality requirements definition is based on the union of the implied needs of classes of users, appropriate metrics are selected and measurements carried out, but the individual user is left to construct his own rating level definition and assessment criteria definition.
The next section includes a description of some preliminary attempts to provide a concrete instantiation of the EAGLES framework in the form of a parameterisable test bed: a software implementation which contains formal descriptions of systems or products and of characteristics of users, together with specifications of metrics and measurement methods. Parameters allow the needs of a specific user to be reflected in the choice of metrics to be applied. It should be noted though that not all metrics are automatable. In many cases, the test bed produces instructions for how a human should proceed in order to obtain a measurement for some attribute singled out as pertinent. The parameterisable test bed is a direct result of the TEMAA project.
The parameterisable test bed is of necessity based on a formal definition of evaluation and on formal descriptions of user characteristics and of system characteristics. In line with much current work in computational linguistics, EAGLES/TEMAA thinks in terms of features, made up of attribute/value pairs. The definition of features may come either from a consideration of the implied needs of users or from considering the characteristics of systems which already exist. Attribute/value pairs are intimately related with the metrics used to determine the values. We have already noted that whilst ISO 9126 regards metrics as ideally leading to quantifiable measures, it is recognised that this cannot always be the case. As well as metrics based on quantifiable measures, often called tests in EAGLES/TEMAA, EAGLES/TEMAA also recognises facts -- attributes whose value is simply a fact such as the language dealt with by a spelling checker, and binary and scalar attributes, some of which may explicitly involve subjective human judgement. Thus attributes are typed by the kind of value they may accept. The next section goes into more detail and gives a more formal account of the machinery we have used.
The ISO 9126 Standard quite deliberately leaves aside any discussion of how metrics are to be created or validated. Since EAGLES/TEMAA is involved in practical application of the general framework, such questions cannot be neglected. In particular, both measures and the methods used to obtain a measurement must be valid and reliable; that is, a metric should measure what it is supposed to measure, and should do so consistently.
The notions of validity and reliability as used within EAGLES draw on work in the social sciences and in psychology. Although there are several conceptions of validity to be found in the literature, they all essentially fall under one of two broad categories: internal (or contents) validity and external (or criteria based) validity. Internal validity is achieved by making sure that each metric adequately measures an appropriate attribute of the object to be evaluated. Internal validity is assessed by the judgement of experts. External validity is determined by calculating the coefficient of correlation between the results obtained from applying the metric and some external criterion.
A couple of examples will help to make this more concrete. Reading tests are often administered to primary school children to determine whether they can read as well as an average child of the same age. The child is required to read aloud a specially constructed text, which makes use of the vocabulary which it is considered a child of a specific age should be able to deal with. This test relies on internal validity. Whether the vocabulary chosen is or is not well-chosen is judged by experts on the reading skills of children.
Another test frequently administered to school children is an IQ test. The usefulness of such tests has often been the subject of contention. A frequent type of argument to be found is based on the notion of external validity: the results of the tests are shown to correlate well (or badly) with later success in academic examinations, for example, or with higher income levels in middle age.
A metric is reliable inasmuch as it constantly provides the same results when applied to the same phenomena. Reliability can be determined by calculating the coefficient of correlation between the results obtained on two occurrences of applying the metric.
Considerations of validity and reliability are not always so clear-cut as they have been made to seem in the above discussion, particularly when the evaluation is concerned with products treating a phenomenon as complex as language, and where human intervention is needed sometimes to obtain a measurement. More detailed discussion of particular metrics will raise further questions. However, the goal to be aimed at is clear.