However, there remains the problem of how to transform requirements at the level of the problem domain into a form in which it is possible to test them against all the relevant systems under test (i.e., to transform them into specifications, in Jackson's terms, or reportable attributes, in our terms). This clearly involves not just top-down development of requirements, but is affected by the nature of the systems under test. In terms of the V-diagram, the downward direction of the process of software development is equivalent to left-to-right progress through the sets in Figure C.2.2, from problem through specification to design in terms of implementable primitives. Instead of aiming at a specification expressed in terms of implementable primitives, the evaluation process aims at identifying a set of reportable attributes expressed in terms of measurable primitives, but otherwise the processes are similar in that a bottom-up aspect is imposed by the nature of the available primitives. The validity of the specification or reportable attributes depends on the validity of the problem domain requirements and the validity of the process by which the specification or reportable attributes are derived from them.
Different systems under test may have different ways of scoping the problem they deal with, and hence be difficult to evaluate against one another. Not only that, but where a set of products do appear to converge on a definition of functionality, there is no guarantee that this definition matches user and task-based requirements. These have to be established independently by the evaluator. (Particularly for mass-market software applications, there are commercial pressures towards convergence on the latest `check-list' of functionalities which are largely independent of utility. It must be part of the evaluator's job not merely to list these features and report on their presence or absence in given products, but to provide judgements about their relation to utility.)
Then again, in the interests of software reuse, a component-based approach is popular in both general software engineering and LE . Various types of progress evaluation rely on the ability to devise a modular breakdown of the overall functionality such that individual components can be specified, evaluated and chosen independently (or designed, implemented and evaluated independently.) However, it can be difficult to compare components against one another when they do not fit in the same way, requiring different setups in order to work at all. As discussed in (Galliers93), NLP is an area in which existing systems are seldom suitable for immediate use in a new application area. Any new use will involve more or less customisation of a generic system, for example by lexicon and grammar modification. How then is it possible to evaluate the relative suitability of different generic systems for a task?
It can also be difficult to isolate the system's task for evaluation in some NLP systems which provide interactive and partial support for some user task, such as grammar checkers in the task of proof-reading, because the most valid measurable results (e.g., the quality of the output text) are only applicable to a setup that combines user and system performance. System performance can only be evaluated by factoring out as a separate performance factor the effect of different end-user types on the overall performance. Relative to this independent variable, more detailed, measurable attributes of the system's contribution can then be defined.
Transforming requirements from problem domain terms into reportable attributes can be seen as a type of redescription problem, constrained by the necessity that redescriptions be valid in some sense. In terms of the EAGLES evaluation framework, we need to know where a certain reportable attribute requirement comes from in terms of the actual problem domain, and whether the derivation is valid; information supporting the choice of the attributes used should ideally be part of evaluation documentation, just as much as information about the methods used to arrive at values for the attributes. For requirements engineering in general, such validation is required to show that specifications are equivalent in some sense to the problem domain requirements they are intended to express -- that they are in a sense alternative descriptions of `the same thing'. There is considerable work in Requirements Engineering on the traceability of requirements in this process of redescription or transformation (Gotel93) and this may be of use in the future to our evaluation work.
However, unlike software design, where the ostensible aim is to produce a design that is fully equivalent to the requirements, part of the purpose of evaluation is to informatively point out where designs fail to fill requirements. As an example, no spelling checkers of the normal type can correct errors stemming from simple typing mistakes which result in legal though unintended words, such as typing form instead of from; yet a realistic problem domain requirement would certainly view this as a spelling error. The discrepancy arises at the point of defining the idea of a spelling error at the attribute level as a `best-equivalent' to its definition at the problem domain level; an explicit and structured process of decomposing or transforming problem domain requirements into measurable attributes can provide opportunities for noting where and how such discrepancies occur. While it might be a bit over the top to have a whole attribute devoted to reporting this failure for every system under test, such discrepancies between a problem domain requirement (correct typos) and the available means to satisfy it should be included in any accompanying discussion or guide to the use of the attribute grid.
Ascertaining which factors are truly valid determinants of software quality is a similar problem; there needs to be a program of validation of requirements as well as validation of how well the reportable attributes and the measures and methods associated with them reflect requirements. This process cannot be completely codified, and must to a large extent be driven by open-ended evaluation in realistic situations, feeding back unforeseeable insights into the requirements statement. However, there are a number of knowledge acquisition methods which provide useful frameworks for the generation of descriptions of the setups, and hypotheses about what factors in the setup may be performance factors. These must then be tested by holding other factors constant; the combinatoric testing task may be considerable, but is the only way to isolate the effect of the system under test.
The next section looks at some knowledge acquisition methods that may be relevant to LE evaluation.