This section describes three different reviews of grammar checkers. The three reviews serve as a starting point for the development of the quality characteristics specific to grammar checkers which are presented later in section Quality characteristics.
This report (Douglas90) describes the capabilities of some of the most common automated writing aids, primarily from a linguistic perspective, but also dealing with user-friendliness in terms of the user interface and the degree of customisability offered by them. The evaluation was carried out in 1990 on a set of six then commercially available grammar and style checking programs (Correct Grammar, Grammatik, MacProof, RightWriter, Sensible Grammar, and StyleWriter). The evaluation had two aims: firstly, assessing the adequacy of the systems, to ascertain the commercial state of the art, and secondly, assessing the kinds of technology used and how this related to the adequacy performances displayed. The second aim was intended to provide input to the design and development of the automated writing aid The Editor's Assistant, at the University of Edinburgh.
The two-fold aims of the investigation are mirrored in two linked taxonomies. The first taxonomy, the user centred taxonomy, provides a list of error types, intended to provide a system of categories by which the errors detected by systems would be classified so as to correspond to the user's conceptual model of the domain of activity. The second taxonomy, the system centred taxonomy, aims to present a characterisation of possible underlying mechanisms used by different systems, on a number of dimensions. The two taxonomies are linked by the fact that detecting and correcting particular errors (or examples of errors) was deemed to require a certain level of linguistic technology. The aim of the evaluation was to use the user centred taxonomy to position the system under test within the system centred taxonomy, i.e., to determine, from the results of a set of tests based on the user centred taxonomy, what level of technology the system under test utilised, and thus its potential range of operation. This is a sort of general diagnostic evaluation, not unrelated to the idea of reverse engineering.
The dimensions of the system centred taxonomy were:
The numeric results assigned to dimensions (1) and (2) were intended to reflect the overall potential capability of the system's underlying technology to detect and respond to grammatical errors. The division into how the text is apparently represented and what kind of patterns the error rules can find in the text was intended to clarify the level of linguistic sophistication of the underlying methods used by the various systems. These dimensions, and to some extent the third dimension (which reflects whether responses from the system seem to be all canned text or whether they have variable pattern elements), form a group of properly linguistic objects of evaluation.
The user centred taxonomy is divided into four top-level sections, grammar, punctuation, style and usage, of which only the first is further considered here. Grammar errors are subdivided into agreement and inflection errors, errors in the use of relative and reflexive pronouns, errors in comparatives and correlatives, problems with negation, and numerous heterogeneous categories. Punctuation errors are considered according to the following classification of punctuation functions: parenthetic marks, sentence terminators, sentence punctuators, spacing and hyphenation. Moreover, problems concerning how these can appear in combination are covered.
Concerning style, the aim is to write clearly, directly and in an easily comprehensive English. Among the style errors, the use of the passive voice, nominalisations and the use of over-long or complex sentences are mentioned.
Typical usage errors are capitalisation and double word errors. Some other errors are due to confusion such as phonetic similarities. In other cases, they are concerned with the appropriate use of language in terms of genre or dialects, such as British English vs. American English and formal vs. informal style. In addition, since not all documents will have the same requirements, house style is mentioned as a way of imposing consistency over a wide range of characteristics of text.
Associated with each error type are examples and requirements. The latter are intended to reflect what underlying linguistic sophistication a system would need to successfully detect the error in question and hence they provide a link from the user centred taxonomy to the system centred taxonomy.
The values assigned on the dimensions of assessment derive from the results of applying a test suite based on the user centred taxonomy. The test suite was composed of an extension of the examples in the user centred taxonomy. Each error type has its corresponding set of examples in the test suite. Test examples are constructed in both positive and negative modes, that is, where an error exists in the construction and where one does not.
The set of possible outcomes for each example reflected the dual aim of testing for positive and negative success. For an example marked with an asterisk, to denote the fact that it contains an error, there are five possible outcomes, described in table D.1.1.1.5.
| Outcome | Explanation |
| response identifies error clearly | |
| system fails to respond to error | |
| the error could be diagnosed from the response, especially if it is an error of execution (i.e., the user only needs his attention drawn to it to recognise it) | |
| there is a response, but sufficiently indirect to make error diagnosis difficult | |
| response completely unrelated to error (i.e., a false positive showing up in part of the suite not specifically designed to trap it) |
For an example not marked with an asterisk, the outcomes shown in table D.1.1.1.5 are possible.
| Outcome | Explanation |
| correctly ignores the correct text | |
| falls into false positive trap | |
| particularly awful false positive |
Apart from the use of the exclamation mark, which was intended to draw attention to some particularly nasty example of false positive, there was no explicit weighting system.
The system-based aspect of the taxonomy of errors that was developed for adequacy evaluation was the basis for the progress evaluation scheme for The Editor's Assistant. Error types were grouped into four categories, viewed from a computational linguistic point of view:
These error types are associated with general error detection/correction mechanisms in The Editor's Assistant. The third type (associated with syntactic awkwardness) was similar to pattern-recognition algorithms, since it merely recognised patterns in the analysed text. The other three types were based on different error hypothesis schemes and relied on an extra processing loop to test whether performing the suggested correction on the hypothesised error actually led to a better sentence. Work focused on developing implementations of these general mechanisms.
In May 1993, PC Magazine published a review of writers' tools based on their own performance tests (Rahmstorf93). In this section we will briefly describe the methodology adopted by PC Magazine Laboratories in performing grammar checker tests.
The systems reviewed were Correct Grammar for Windows, version 2.0 and Correct Grammar for DOS, version 4.0, Grammatik 5 for DOS and Windows, Rightwriter, version 6, and CorrecText.
The test material comprised examples of typical punctuation and capitalisation errors and 75 typical grammar and style errors. Each error occurred in two different sentences to avoid the potential problem that a single example would be unusually easy or difficult for a particular program to detect. In many cases a grammar program could detect only one of the examples of a problem.
The evaluation of the output for each sentence involved giving one of four possible responses and assigning an appropriate score, as shown in table D.1.1.2.
| Response | Score | Comments |
| Hit | 1 | The program flags an error. |
| Miss | 0 | The program does not detect an error. |
| Advice | 0-4 | The program gives one or more good suggestions about how to correct an error. Each sentence with this response was given a score from 0 to 4 according to the quality of suggestions given by the program (a score of 4 means that the error was flagged and all elements contributing to the problem were detected). |
| False flag | 0 | The program detects a non-existent error or recommends a correction that introduces an error. Negative points were given for flagging nonexistent errors or providing advice that introduced new errors. |
An additional score used is Hits-to-False-Flags Ratio, being the relative number of hits and false alarms. The higher the ratio, the less likely that the program will report nonexistent errors or offer bad advice.
Grammar checkers were also evaluated for their spell-checking abilities. For this purpose, the test material contained 100 misspelled words of ten different categories. 1 point was given for detecting a misspelled word and suggesting the correct alternative.
(Chandler89) reviewed a number of grammar checkers for Electric Word. Chandler, an English teacher at the University of Hawaii, was interested in assessing the usefulness of grammar checkers for his students, who included a large number of Asian second language students. The systems he tested were Grammatik III, Correct Grammar and Critique. He assessed the systems not only for their linguistic accuracy in detecting errors (grammatical and usage errors) but also for their user-friendliness. Under the second category he considered factors such as the helpfulness of the tutorial (one that presented examples was more helpful than one that just stated rules) and the type of editor. Chandler also considered it helpful if parse trees were presented to the user. Additional factors that he took into account were how easy it was to add words to a system's dictionary and the price of the system.
Chandler's assessment of the systems according to the dimensions outlined above appear to be entirely subjective, except in the assessment of error detection, where he used a very small test suite. He ran the test suite twice through each system. The phenomena tested were intended to contain typical errors made by Asian second language students. Errors included grammatical errors (including agreement, comma splices, and subordination and modifying phrases) and usage errors (e.g. a for an and childrens for children's).
The evaluations described in the above reviews differ vastly in scope and rigour of approach. Nevertheless, there is a fair amount of agreement on which characteristics are important when assessing the performance of a grammar checker. These performance characteristics or functionality characteristics include: