The task addressed in this section is the choice of reportable attributes that will form columns of a Consumer Report. Each of the quality characteristics mentioned in earlier sections will give rise to a number of such reportable attributes.
The attributes we seek to identify are descriptions of a system's performance in a given setup which allow the customer (the user of the evaluation) to make the decisions they require about a set of systems in a class. We have just seen how a combination of the requirements and task model can be used to give detailed requirements statements that can be based on measurement. Our task here is to decide what to consider when reporting the results of these tests to the user.
The choice of reportable attributes is driven by considering the requirements of the customer -- the user of the evaluation. In general, the customer and the end-user (the user of the system) may have different requirements, and they play distinct roles in the design of the evaluation. The customer's requirements may be based on the impact of the whole end-user/system setup that is being evaluated on wider tasks in a wider corporate setup. In the case of grammar checkers (which are relatively mass-market products commonly used by non-corporate writers) the customer and the end-user may often be the same person, but in principle the end-user who actually interacts with the system is a variable in the setup the customer is evaluating. Thus, arriving at a list of reportable attributes involves an analysis of the interaction of the product and the end-user in the immediate task, and of the place of this task and its outputs in the wider task that may interest the customer. We have already outlined a model of the immediate task, which is relevant both to this choice of attributes and to the development of test methods for the functional aspects of the systems. In this section, a further model, of customer types, is in theory called for.
In the case of grammar checkers, and particularly the functionality attributes we are currently concentrating on, we feel that customer and end-user perspectives are not likely to differ significantly . This is particularly true of the quality characteristic we are concentrating on in this report, functionality. Different characteristics, however, will be relevant to each role, and so we continue here as if our end-user model has no relation to our new customer model.
The customer must be analysed to find out how the functionality results ought to be reported, in terms of choice of reportable attributes, and perhaps what supplementary material, such as explanations of attribute choice and methods or case studies of typical users, should be supplied. Choice of attributes comes first, and derives from the researcher's knowledge about the system's performance on the task and the customer's requirements; how to describe them is secondary.
However, it should be remembered that the Consumer Report is a multi-user artefact, in that it contains information relevant to a number of different customer classes, and indeed end-user and writer classes in which customers may be interested. It relies on self-diagnosis by the customer in choosing which attributes to pay attention to, and how to combine the values for those attributes to form their own evaluation. Thus, our aim is to choose attributes that will support most evaluations of grammar checkers. In terms of the functionality aspects we are concentrating on, this means producing information that can be interpreted for a number of different end-user/writer types.
The customer can be expected to know and construct their own comparative evaluation on certain aspects of end-user, writer, and system compatibility, such as the target language of the system and the language in which its advice is couched. Accordingly, such system characteristics will be included as straightforward attributes that will appear in the report.
We have motivated the production of a taxonomy of errors that is at least notionally couched in terms close to the writer's sources of error. However, it was also designed at quite a detailed level to support the development of test material, and so may be too detailed to be used as a set of attributes for reporting functionality to the customer. We may want to talk about the functionality performance for particular types of error individually -- especially errors whose significance varies with writer or end-user classes. For other types, we may want to group a number together under a group attribute name. Our emphasis here is on NLP evaluation, and it is relatively clear that NLP, broadly conceived, is central to the functionality quality characteristic of grammar checkers. Given the mass-market nature of such tools, it is also clear that, whether the customer is the same as the end-user or not, it is likely that we will need to put a good deal of effort into finding ways of presenting and explaining the results of testing functionality in terms of attributes that are meaningful to the customer. Standard sets of examples and illustrations, and particularly case studies giving examples to allow a customer to diagnose their requirements, should be developed as part of our future work, and for anyone building on it, for example in a new language.
Thus, we have attributes based on error types and groups of error types. When we start to think about what sort of measure to use to convey performance on these errors, we encounter some complications that might lead us to convey some of the information not as measures (because they would be too complicated, or have too many dimensions) but as separate attributes, which might be grouped into aggregates or replicated individually for each error type attribute.
One such measure relates to precision/recall, which is a key property of this application, as in information retrieval, but here is complicated by the range of advice types. It is almost certainly not valuable to many customers to present measurements of recall separate from some consideration of precision (saying that a given system succeeds in all instances of it's/its confusion if in fact it flags all instances of either is likely to be misleading, even if the precision figures are given somewhere else). However, different end-users have different requirements and different interfaces may change the effect, as may customisability. An end-user who needs a lot of tutorial advice will not be benefitted by a high recall, low precision and poor advice type service, while they might be able to use the same recall and precision if the advice included enough information to let them make an accurate diagnosis.
Another such measure relates to the type of coverage a system has for a particular error type. Some systems may find only easily identified instances of an error type, but do that reliably. This may be of use to some users and not to others, who only make more complicated versions of the error. A measure that simply aggregates results of different levels of difficulty, with whatever weighting function, will not be able to support customer self-diagnosis on this dimension.
There is a case, therefore, for having separate attributes for noting false positive and coverage variability.