(...) on the construction of a PTB as a concrete instantiation of the notion of evaluation in language engineering. It should be noted that the conceptual framework developed here in the context of language engineering applies much more generally to evaluation in general.
As said above, we feel that future EAGLES work should focus, amongst other things, on the construction of a PTB as a concrete instantiation of the notion of evaluation in language engineering. It should be noted that the conceptual framework developed here in the context of language engineering applies much more generally to evaluation in general.
It should be clear that this is a partial approach to evaluation. It needs to be supplemented with work on the relation between users' activities and goals (e.g. in working environments) and users' needs, and it should be integrated with existing views on evaluation procedures as given in ISO 9126.
A (parameterisable) test bed can only be partially automated, since many of the tests needed to evaluate objects can only be partially automated; many tests that are needed to evaluate objects are still not well understood and other tests cannot be automated in the current state of technology. We do not have computational procedures yet that can tell us which of two possible iconics is more easily guessed as meaning `cut highlighted text'. Nevertheless, we believe that the idea of an automated procedure is very important for a systematic approach to evaluation.
A Parameterisable Test Bed (PTB) is a program that:
Examples of relevant objects are: spelling, grammar, style checkers; information retrieval; and translation systems. Note that a PTB may not be able to perform the whole procedure without human assistance; see below, under `library'.
Objects are described in terms of attributes and values. The set of objects is structured (subtypes, components) as well as the set of attributes (e.g. functionality-related, usability-related, etc).
Under the current view, a user is seen essentially as a list of desirables of an object. That is, a user is described in terms of the choice of attributes, weightings amongst attributes and specifications constraining attribute-values. The section on user requirements outlines some preliminary ideas on how to relate the choice of pertinent attributes systematically to the interests of users of various kinds in interaction with their goals.
The library contains for each attribute a test to compute its value. Importantly, attributes differ as to the types of test they will allow. Some attributes can be tested automatically (e.g. in the TEMAA project (Thompson94) an automatic procedure to test functionality of spelling checkers has been developed); but for other attributes, like those related to `user-friendliness', automated test procedures are much less feasible right now. In the latter type of case, all the PTB can do is print a report on the appropriate testing procedure. Both for automated tests and for those that rely on human co-operation, a PTB should be able to support the creation, integration and maintenance of a library of test materials, for example test suites, used by the different tests.
The PTB will, for each relevant test, either perform it or ask the human PTB user to do this (and feed the result back into the PTB). The result is a description of each object tested in terms of attribute-values.
Comparison between the testing results and the user description (in terms of weighted attributes and specifications, see above, `users') will lead to a partial ranking of objects for a given user type.
In the TEMAA project (LRE 62-070), a very limited instance of a PTB has been implemented, which tests spelling checker functionality.
Creation of a full PTB for language engineering is an infinitely large job, since new object types may appear as well as new kinds of users. Even if restricted to a temporary state of the art PTB, creation is a huge job. However, planning in this direction is a useful thing for evaluation in language engineering in two ways:
The general orientation should be towards a set of programs to be used, not by end users or customers, but rather by specialised `evaluation agents' like language-related organisations such as ELSNET or ELRA, and periodicals such as Linguistic Industry Monitor or various computer journals. It might even be considered that one should aim for a specialised evaluation agency for language engineering.
Editors' note: Should the above be highlighted as Recommendations?
Here are some potential tasks for PTB creation in the near future:
As said above, evaluation for language engineering is not fundamentally different from evaluation in general. Our thinking has been influenced heavily by evaluations in other areas (cf. the `consumer report paradigm'). Although we would like to see the creation of a PTB for language engineering, we hope also that others working on evaluation and on evaluation design may find the EAGLES work which led to the proposal of such a device fruitful and provocative of ideas in their own work, quite independently of any concrete instantiation of those ideas.