The detailed requirements outlined in section Requirements provide the backbone of methods for obtaining values for reportable attributes that deal with part I of the functionality requirement, which deals with positive coverage of errors. We illustrated those discussions with the example of English determiner-noun number disagreement, and we will continue with that example here.
Importantly, what was missing from those discussions was any indication of how the error would be represented concretely in terms of examples. Now that we come to test for coverage of that error, we must address the detailed description of the model of proofed text, which was introduced as an implicit standard relative to which errors are defined.
This part of the functionality requirement deals with positive coverage of errors -- recall in information retrieval terms . Our recommended method for measuring this is to associate with each error type a set of error examples. An error example consists of:
The main requirement, clearly, is to generate a wide range of instantiations of the error type and its correction, varying in grammatical construction and complexity and in lexical choice.
In fact, we work `back to front'. The corrected text examples can be generated using any test suite generator for NLP comprehension (cf. TSNLP work), or from suitable corpora, and so we start from there. We stated earlier, in our task model discussion, that error types should be defined in terms of transformations of proofed text examples -- the examples given there are repeated here:
This technique is purely black-box, relying on a model of the errors and corrections that is based purely on task analysis. There is a further set of techniques commonly used for testing grammar checkers that involves guessing plausible weaknesses in checker behaviour, which is more glass-box testing, relying on some ideas about the models of text used by the checkers. We do not deal with this at the moment, and it is possible that wide enough black-box testing will provide the same result. However, given the difficulties in automating the comparison of the advice given by a system and the advice needed by an end-user, it may be that truly huge test sets are much more difficult to administer for grammar checkers than, for example, for spelling checkers.
The method for using the error examples is relatively obvious: the error sentences are presented to the system, and we have to judge whether the system response would allow an end-user to produce the corrected sentence; as discussed previously the sometimes vague nature of advice given complicates this. Thus the error examples we have defined do not quite constitute a direct (`by inspection') method. The method will award a score to each response, with high scores for very specific correct advice graded through to very low scores for no advice or misleading advice. (For this part of the functionality requirement we do not penalise misleading advice more highly than no advice; that will be done in the testing for the second part, and the results combined as discussed in the following.)
Building on these scores, we then need to find a way to combine these test scores to get aggregate scores for the error type as a whole, or indeed for groups of error types where that is the level of reportable attributes. The weighting functions and the combination procedure constitute an interpretation scheme for the method, which is based on: