There is a number of issues which are important for NLP evaluation,which any requirements analysis method must take into account.
Adequacy evaluation typically deals with the comparison of multiple, already existing software systems. The scope of the problem dealt with by each system may be different in each case, hence it may be difficult to evaluate systems against one another. Not only that, but where a set of products does appear to converge on a definition of functionality, there is no guarantee that this definition matches user and task-based requirements; these have to be established independently by the evaluator. A further complication for this type of evaluation, taking it further from the type of requirements analysis suitable for design, is that the introduction of a software system inevitably changes the tasks that preceded it.
Perhaps the most serious problem here, however, is that the needs of the user must be taken to be implied, since they are certainly not stated. It is here that the requirements analysis process will be most different for adequacy evaluation as opposed to progress evaluation, or any evaluation where there is not an extensionally identified user group, since realistic requirements analysis for design usually requires dialogue, prototypes and mockups to iteratively develop and test a description of the user's needs.
The main purpose of a requirements statement from the point of view of software development is to support the subsequent design and implementation phases. When our aim is evaluation, this is no longer a priority. A new end-audience for the statement is important, however: the customer of the evaluation, the person or organisation from whose point of view the evaluation is made. If we are to take seriously the needs of the customer of the evaluation we need to analyse their requirements for information (including the form of the information) as much as the technical requirements. For the consumer report style of evaluation, the presentation will have to accommodate multiple viewpoints of this sort.
It should go without saying that no meaningful adequacy evaluation can take place without a good understanding of the contexts or setups (human, organisational and computational) into which a system must fit. Ascertaining what kinds of setup are relevant to a particular evaluation, and which contextual factors affect quality requirements, is crucial to the requirements analysis process for all kinds of evaluation.
In the interests of software reuse and comprehensibility, a component-based approach is popular in both software engineering and LE. Various types of progress evaluation rely on the ability to devise a modular breakdown of the overall functionality such that individual components can be specified, evaluated and chosen independently, or designed, implemented and evaluated independently; however, it can be difficult to compare components against one another when they do not fit in the same way, requiring different setups in order to work at all.
For some NLP systems which provide interactive and partial support for some user task, such as grammar checkers in the task of proof-reading, it can be difficult to find a way of isolating the system's task for evaluation, because the most valid measurable results (e.g., the quality of the output text) are at a level that combines user and system performance. Analysis of the context of use, in terms of different user contributions to and constraints on performance, is then necessary.
These are in a sense the same problem, since they are about the difficulty of constructing a level ground for evaluating systems with different scopes of operation.
As discussed in (Galliers93), NLP is an area in which existing systems are seldom suitable for immediate use in a new application area. Any new use will involve more or less customisation of a generic system, for example by lexicon and grammar modification. How then can the relative suitability of different systems for a task be evaluated? This is somewhat different from the issue of evaluating component-based systems, although clearly related. It may be necessary to take into account the ease of customisation of particular generic systems, to the extent that customisability becomes a major functional requirement in its own right rather than being subsumed under the `modifiability' quality characteristic.
The stage of requirements definition in a software project is never as self-contained as implied by the Figure. The V-diagram should only be taken as indicative of the relationships between the results of the various analysis and design, implementation and evaluation stages. The actual processes by which these results are arrived at are more likely to follow an iterative procedure, such as formalised in Boehm's spiral software development process model, (Boehm88) and this is true of pure evaluation too. Ascertaining which factors are truly valid determinants of software quality is a similar problem. There needs to be a programme of validation of requirements as well as validation of how well the reportable attributes, and the measures and methods associated with them, reflect requirements. This process cannot be codified and must to a large extent be driven by open-ended evaluation in realistic situations, feeding back unforeseeable insights into the requirements statement.
How wide should the bounds of the evaluation be set? What is the scope of the requirements? We have already seen that, when we are comparing a number of existing systems, they may not cover the same ground. When a software system is placed within a human process, as with many interactive systems, it may be relevant to evaluate the human plus software system as a whole, or at least to aim at validating and ranking requirements of software quality in terms of their correlations with the performance of the system as a whole. For instance, in the case of spelling checkers, it must be relevant to the impact of spelling checkers on the overall task of document quality assurance to know that the errors that spelling checkers are able to spot are also those that people find it easiest to spot (namely, errors that do not result in other legal but unintended words).
New specialist research in requirements engineering is making common ground with knowledge acquisition techniques developed in Knowledge Based Systems/Expert Systems work. This addresses the problem of getting domain expertise into representations usable for subsequent design (or in our case, evaluation). NLP has its own methods, particularly for actual language analysis, but better use could be made of more general methods, particularly where whole systems and not just NLP components are being evaluated.