In the interest of standardisation, this section describes initial steps towards formalisation of evaluation.
To evaluate is to determine what something is worth to somebody. We describe evaluation as a function relating objects and users to something we will call utility. Utilities can sometimes be expressed in financial terms, but that does not concern us here. The important thing is that utilities represent a consistent preference relation among the items utilities are assigned to (cf. utility theory).
We then look at the nature of the descriptions of objects, found in e.g. consumer reports. Some useful primitives are introduced. The formal machinery is taken from the world of feature structures, well-known to computational linguists.
We then define some notions relevant to evaluation in terms of these primitives.
To evaluate is to determine what something is worth to somebody. We can summarise this in the following function: where
V: represents the idea of utility that drives any evaluation: the basic idea is that evaluation expresses what some object is worth to somebody; expresses `worth'. Utility may sometimes be related to money, but this cannot in general be assumed. We will tentatively define as linearly ordered. This means that we can at least define relative utility by mapping object-user pairs to .
O: represents objects of evaluation. Anything can in principle be evaluated, including dishwashing machines, project proposals, progress in ongoing work and evaluation procedures. In this report, we restrict O to computer programs containing some linguistic knowledge and will take all examples from this domain. The object of evaluation can be structured, i.e. it can sometimes be seen as a structure of components or functionalities that can serve as objects of evaluation themselves. For example, a translator's workbench can contain components such as a special editor, an on-line terminology database and a translation memory; the latter can be further subdivided into update and application functions. An evaluation-related question about the package as a whole may, for example, examine its integratedness or the requirements it imposes on the hardware platform. Other questions pertain to components; for example, the update/maintenance properties of a term bank may be very different from those of a translation memory.
U: represents users, i.e. people or organisations (potentially) interested in members of O. The notion of user is philosophically complicated. Perhaps the best view is to see it as a certain desire. A user is somebody who wants to have something or get something done. Users come in kinds. For example, the owner of a translation bureau may have a different perspective from a translator they employ. The latter may find aspects of `user-friendliness' of some computer tool more important than the former. As a more specific example, the presence of a component such as shared terminology validation procedures in a translators' package will be more relevant for translation organisations than for freelances who work on their own.
Carrying this line of thinking further, all the factors which are often called environmental or situational variables help to define the user's desires, and are therefore part of U. If we are considering a system which can be broken down into distinguishable components, some of which may be subject to individual evaluation, we can even go so far as to say that the constraints one component of the system imposes on another (for example in the form of required output) form part of the user's desires: the user wants a task to be performed, and therefore wants all the sub-tasks of that task to be performed. Thus U includes not only all the constraints and desires resulting from the user's environment, but also, where relevant, the constraints imposed by sub-components of an overall system which might fulfill the user's needs.
We should also keep in mind that all relevant distinctions in the contexts of use can be seen as distinctions amongst types of user. In future work, U may be broken down to reflect the granularity of these distinctions. It should then become possible to see U itself as a function of conditions such as a particular kind of writer population, some specific bias in spelling errors, the fact that a personal computer has to be used as the hardware platform, etc.
The basic function given above can be curried in two ways, obtaining two perspectives on evaluation:
describes the `object-based' picture: given some object, evaluation tells us who likes it;
gives the `user-based' picture: given some user, evaluation tells us what they like.
A central role in practical evaluation is played by descriptions of objects (and components thereof) in ways that help to determine their utility value for various kinds of users. We think it is attractive to describe objects of evaluation in terms of typed feature structures, i.e. pairings of a type and an attribute-value structure.
An object type corresponds to a class of objects in O defined by the fact that some specific function is executed by all. Some possible object types in our domain are: editor, term bank, spelling checker. A program will usually be indicated by a concrete or agentive noun, e.g. parser. The function it performs is usually indicated by a nomen actionis, e.g. alignment. We will use types indiscriminately to denote both the programs and the functions they fulfill.
An attribute refers to a property to which can be assigned, for some given member of O, one of a range of values. For example, some C compiler can be described by attributes like speed, version of the language, various debugging options, etc. Some car can be described by attributes like speed, fuel consumption, various attributes related to safety, etc.
Examples of attributes for a translation memory are:
Attributes are typed according to their possible values. The range of values (scale) can for example be Boolean (yes/no), nominal or classificatory (a set of unordered values), comparative (a set of ordered values), ordinal (a range of values whose differences can be compared) or metric (real valued with fixed origin and unit). Some attributes can have other feature structures as values.
The attributes should be chosen with a view to their relevance to utility. Some general principles guiding the selection of evaluative attributes are summarised in section ISO 9126. Some further principles for attributes are described in Appendices Evaluation of Writers' Aids and Evaluation of Translators' Aids.
Furthermore, the attributes should be chosen in such a way that it is possible to establish, for a given object, what values it takes. That is, the attributes should be measurable. As we will see later, attributes can be typed according to the method of measurement needed to establish the value for a given object (section Methods for system mesurement, on test types).
The matrices one typically sees in consumer-oriented evaluations of products are attribute-value matrices in this sense. In Appendix Evaluation of Translators' Aids, section Featurization, worked-out descriptions in this form are given.
Explicit reference to user types is not commonly made in consumer reports. Sometimes, in the accompanying text, specific user profiles are mentioned (e.g. reference to people who wear spectacles in reports on binoculars); but most often, these reports assume that users know what they want. References to V are made unsystematically, e.g. in the form of a remark about some price being too high. The emphasis in consumer reports is essentially on describing the objects of evaluation. The reason for this is, partly, that differentiation according to user profiles is hard and the function to V often impossible to specify; and partly, that these consumer reports are typically about well-known classes of objects like dishwashers, so that the readers can be assumed to know well what they like. This latter situation does not apply to all linguistic software.
Members of O perform functions, functions are characterised by types and attributes, attributes take values. Splitting up a software item in terms of a typed feature structure will be called featurisation in this section. Each type of feature can allow recursive refinement into subfeatures.
A feature is an attribute-value pair. Examples of features of a translation memory are the presence or absence of certain components or functions; or the values of metric attributes like a certain maximum size or speed of retrieval. A higher level feature can be complex valued (i.e. have a feature structure as value).
A feature checklist is like a featurisation, but the values are left open. Elaborate examples of feature checklists are given in section Featurization.
A quality characteristic is a projection of a feature checklist (picking out a certain subset of the feature structure). Example: the basic characteristics listed in Appendix ISO Terms and Guidelinesa-iso section Quality Characteristics.
A specification is a constraint on featurisations. Example: a desired dictionary can be specified as containing at least 200,000 lemmas.
Criterion, as the word is used in ordinary language, can be defined as synonymous to specification (though the pragmatics of the two words are different). In this section, criterion is usually used as a synonym to attribute.
A norm is, again, a constraint on featurisations. Norms can be used in a prescriptive way (which makes the word very similar in pragmatic meaning to specification) or in a descriptive way (describing the state of the art).
A user profile is a function from the domain of featurisations to V (i.e. it defines some class of users in terms of what they like or require). In Appendix Evaluation of Translators' Aids, section Translators' aids: user profiles, classes of users are exemplified.
A test is an object that produces values for a given attribute, given members of O. Tests can be typed by attributes tested, inputs, outputs, tools, procedures, personnel, duration, .... For a discussion of test types, see section Methods for system measurement.
Not all facets of the evaluation function can be standardised. Especially the relation between measurement and utility is often difficult to define explicitly. This will be called the problem of validity: are the attributes really related to utility? Attempts to verify the validity of featurisations in practice include estimation of effects on productivity of tools in industry, and market surveys asking whether some class of potential customers would be willing to spend a certain sum on a hypothetical product according to some specification.
Efforts towards standardisation of evaluation in the domain of natural language processing should be directed towards featurisation in the first place. The featurisations themselves cannot be standardised, but one can aim at standard feature checklists (i.e. the attributes only) per object type.
Other priority areas for investigation are user profiles and test types.