Requirements analysis for evaluation aims at providing a description of the problem domain requirements. This description must support the subsequent development of detailed reportable attributes for systems under test. The requirements part proper should consist of:
A requirements analysis procedure for language engineering evaluation aims to provide guidelines for the construction of these outputs and the final output of reportable attributes. Such guidelines will never amount to a deterministic procedure. Instead, they should form a framework which can support the activity of requirements analysis by making available:
An important part of the usefulness of generic or pre-existing requirements elements is the fact that, having been used in previous evaluations, they are associated with elements of the library of available test types.
What follows is based around the example of spelling checkers, partly to illustrate the usefulness of reuse and adaptation of pre-existing requirements definitions by comparison with the Writers' Aids subgroup work on grammar checkers (Appendix Evaluation of Writers' Aids); more detailed treatment is given in Appendix Requirements Analysis for Linguistic Engineering Evaluation. The work here is largely confined to the issue of adequacy evaluation.
The basic steps are:
At each level, the potential for reuse and adaptation will be looked at.
The first stage addresses the highest level of the requirements analysis. It is in terms of this level that other levels of analysis, down to the reportable attributes which actually express measurements of systems, must be validated.
The basic tasks the system is required to address are functional requirements, which at the top level of description should relate as closely as possible to valid and measurable user requirements. For tasks which are essentially document transformation filters, such as spell checking, this is relatively straightforward, since the state of the document before the filter is a given (dependent on the setup), and the required state of the document after can usually be determined by analogy with human processes. The system under test can be evaluated, on this highest level, in terms of comparisons between two document types, the input and output to the process illustrated inFigure.
Figure 2.3: Top level data flows and agents in spell checking task.
The next step is to construct a set of relevant setups, identifying situational and environmental variables that affect the requirements fn the task under consideration. This includes the gathering of possibly disjoint sets of requirements from different sources, as the consumer report paradigm allows.
Questions that are relevant for the analysis of a setup include the definition of the upstream and downstream paths of information to and from the document types that form the scope of the top-level task evaluation. For example, if the text is sent to an optical character recognition device after being written, this should be noted, and a new role node inserted into the process diagram. Such nodes are used to structure the identification of variables that are relevant to task performance and facilitate the modularisation of requirements. For instance, the errors present in the text before proofing will be affected independently by variables associated with the writer role (e.g., first language and language of the text) and variables associated with the OCR role. Other relevant elements of the setup include computational and organisational settings.
Libraries of previously used setup elements, with associated quality requirements, would facilitate reuse. For instance, once the kind of spelling errors associated with OCR use have been determined, requirements stemming from these can be modularly combined with new requirements for transfer-based errors for new language pairs.
The validity of subsequent evaluation processes depends on the validity of the methods used to analyse requirements. Language engineering evaluation has special interest in the analysis of texts. Sources of representative texts need to be found; reliable experts need to analyse them; or reliable and applicable prior research must be found that characterises the relevant document types. The development of well-documented collections of representative and realistic text for a wide range of requirements is necessary.
Note that this level of analysis presupposes nothing about the way the transformation or filter is to be accomplished. At this level of abstraction, we can define some quality requirements of the task at the domain level. These will form the basis of more detailed requirements at the reportable attribute level.
Functional requirements can often be defined in terms of classic recall and precision measurements: does the system do all and only what it should? For error checking systems, such requirements are based on a count of errors in the `before' and `after' texts. (The editor role in this process can be thought of either as a human editor in a situation before introduction of any computational tool, or as the combined role of the checking phase carried out by human and software.) Non-functional requirements at the top level might relate to the speed of the overall process of checking a document of a given sort. More detailed functional and non-functional requirements are to be found at the next level of analysis.
After this top level definition of quality, we need to turn to a consideration of the systems we are interested in evaluating.Figure shows the place of the systems under consideration in the new task model.
Figure 2.4: Introduction of computational system.
This task model has the basic structure of a particular sub-type of text transformation systems, namely interactive or computer-assisted text transformation systems. At this level of analysis, another set of generic quality questions becomes available. The role of the human editor, and the relations between the advice from the system and that editor, become available for analysis and the definition of quality requirements. Further knowledge acquisition is required to determine the possible variable elements in different types of human editor in terms of what kind of advice is useful. At this level of analysis too, all sorts of non-functional quality characteristics of the system become relevant, from usability to compatibility with existing software environments. Questions prompting for requirements for these quality characteristics are then associated with the task model to facilitate requirements building.
Up till now, the requirements analysis has all been top-down. To decompose the basic recall and precision functionality requirement into useful sub-attributes, we need to take a partly bottom-up approach based on the categories that are relevant to system performance. This must be based on some prior experience of the kinds of system under consideration and hence will be liable to improve from repeated open-ended evaluation. For instance, it is only because we know something about the operation and limitations of spelling checkers that we might have a separate sub-attribute for their coverage of multi-word elements like ad hoc.
Unlike software design, where the ostensible aim is to produce a design that is fully equivalent to the requirements, part of the purpose of evaluation is to informatively point out where designs fail to fill requirements. As an example, no spelling checkers of the normal type can correct errors stemming from simple typing mistakes which result in legal though unintended words, such as typing form instead of from; yet a realistic problem domain requirement certainly might view this as a spelling error. The discrepancy arises at the point of defining the idea of a spelling error at the attribute level as a `best-equivalent' to its definition at the problem domain level; an explicit and structured process of decomposing or transforming problem domain requirements into measurable attributes can provide opportunities for noting where and how such discrepancies occur. While it might not be advisable to have a whole attribute devoted to reporting this failure for every system under test, such discrepancies between a problem domain requirement (`correct typos') and the available means to satisfy it should be included in any accompanying discussion or guide to the use of the attribute grid.