next previous contents
Next: Selection of grammar checkers Up: Introduction Previous: Introduction

Reviews

This section describes three different reviews of grammar checkers. The three reviews serve as a starting point for the development of the quality characteristics specific to grammar checkers which are presented later in section Quality characteristics.

Notes from ``Intelligent text processing: A survey of the available products'' by Shona Douglas

This report (Douglas90) describes the capabilities of some of the most common automated writing aids, primarily from a linguistic perspective, but also dealing with user-friendliness in terms of the user interface and the degree of customisability offered by them. The evaluation was carried out in 1990 on a set of six then commercially available grammar and style checking programs (Correct Grammar, Grammatik, MacProof, RightWriter, Sensible Grammar, and StyleWriter). The evaluation had two aims: firstly, assessing the adequacy of the systems, to ascertain the commercial state of the art, and secondly, assessing the kinds of technology used and how this related to the adequacy performances displayed. The second aim was intended to provide input to the design and development of the automated writing aid The Editor's Assistant, at the University of Edinburgh.

Summary of method

The two-fold aims of the investigation are mirrored in two linked taxonomies. The first taxonomy, the user centred taxonomy, provides a list of error types, intended to provide a system of categories by which the errors detected by systems would be classified so as to correspond to the user's conceptual model of the domain of activity. The second taxonomy, the system centred taxonomy, aims to present a characterisation of possible underlying mechanisms used by different systems, on a number of dimensions. The two taxonomies are linked by the fact that detecting and correcting particular errors (or examples of errors) was deemed to require a certain level of linguistic technology. The aim of the evaluation was to use the user centred taxonomy to position the system under test within the system centred taxonomy, i.e., to determine, from the results of a set of tests based on the user centred taxonomy, what level of technology the system under test utilised, and thus its potential range of operation. This is a sort of general diagnostic evaluation, not unrelated to the idea of reverse engineering.

The system centred taxonomy

The dimensions of the system centred taxonomy were:

  1. Representing the text;
  2. Recognising patterns in the text;
  3. Responding to errorful text;
  4. Manipulating the text representation; and
  5. Customising the rules.
The first three dimensions are self-explanatory and should be clear to the reader; dimension (4) relates primarily to interface characteristics -- how the error reports are delivered and how easy editing is. Dimension (5) refers to how much freedom to change the operation of the system is in the hands of the user, for example switching on and off rules or writing new ones. For dimensions (4) and (5), the systems were classified more or less by inspection or consulting the user literature.

The numeric results assigned to dimensions (1) and (2) were intended to reflect the overall potential capability of the system's underlying technology to detect and respond to grammatical errors. The division into how the text is apparently represented and what kind of patterns the error rules can find in the text was intended to clarify the level of linguistic sophistication of the underlying methods used by the various systems. These dimensions, and to some extent the third dimension (which reflects whether responses from the system seem to be all canned text or whether they have variable pattern elements), form a group of properly linguistic objects of evaluation.

The user centred taxonomy: A high-level specification of functionality

The user centred taxonomy is divided into four top-level sections, grammar, punctuation, style and usage, of which only the first is further considered here. Grammar errors are subdivided into agreement and inflection errors, errors in the use of relative and reflexive pronouns, errors in comparatives and correlatives, problems with negation, and numerous heterogeneous categories. Punctuation errors are considered according to the following classification of punctuation functions: parenthetic marks, sentence terminators, sentence punctuators, spacing and hyphenation. Moreover, problems concerning how these can appear in combination are covered.

Concerning style, the aim is to write clearly, directly and in an easily comprehensive English. Among the style errors, the use of the passive voice, nominalisations and the use of over-long or complex sentences are mentioned.

Typical usage errors are capitalisation and double word errors. Some other errors are due to confusion such as phonetic similarities. In other cases, they are concerned with the appropriate use of language in terms of genre or dialects, such as British English vs. American English and formal vs. informal style. In addition, since not all documents will have the same requirements, house style is mentioned as a way of imposing consistency over a wide range of characteristics of text.

Associated with each error type are examples and requirements. The latter are intended to reflect what underlying linguistic sophistication a system would need to successfully detect the error in question and hence they provide a link from the user centred taxonomy to the system centred taxonomy.

The test suite

The values assigned on the dimensions of assessment derive from the results of applying a test suite based on the user centred taxonomy. The test suite was composed of an extension of the examples in the user centred taxonomy. Each error type has its corresponding set of examples in the test suite. Test examples are constructed in both positive and negative modes, that is, where an error exists in the construction and where one does not.

Scoring the tests

The set of possible outcomes for each example reflected the dual aim of testing for positive and negative success. For an example marked with an asterisk, to denote the fact that it contains an error, there are five possible outcomes, described in table D.1.1.1.5.

 

 

Outcome Explanation
response identifies error clearly
0 system fails to respond to error
{12} the error could be diagnosed from the response, especially if it is an error of execution (i.e., the user only needs his attention drawn to it to recognise it)
? there is a response, but sufficiently indirect to make error diagnosis difficult
! response completely unrelated to error (i.e., a false positive showing up in part of the suite not specifically designed to trap it)
Table D.1: Outcomes for examples marked with an asterisk

For an example not marked with an asterisk, the outcomes shown in table D.1.1.1.5 are possible.

 

 

Outcome Explanation
&#;` correctly ignores the correct text
falls into false positive trap
! particularly awful false positive
Table D.2: Outcomes for examples not marked with an asterisk

Apart from the use of the exclamation mark, which was intended to draw attention to some particularly nasty example of false positive, there was no explicit weighting system.

Progress evaluation for The Editor's Assistant

The system-based aspect of the taxonomy of errors that was developed for adequacy evaluation was the basis for the progress evaluation scheme for The Editor's Assistant. Error types were grouped into four categories, viewed from a computational linguistic point of view:

  1. Constraint violation errors:
    These involve what, in most contemporary syntactic theories, are best viewed as the violation of constraints on feature values. All errors in agreement fall into this category, for example There are violence in the school for There is violence in the school.

  2. Lexical confusion:
    These involve the confusion of one lexical item with another. Specifically included in this category are cases where a word containing an apostrophe is confused with a similar word that does not, or vice versa. In practice, cases are limited to where the confusion results in an ungrammatical sentence; that is, where the confused words are of different syntactic classes. For example: confusion of its and it's; confusion of there, their and they're; confusion of possessive 's and plural s.
  3. Syntactic awkwardness:
    Included in this category are cases where the problem is either stylistic or likely to cause processing problems for the reader. These errors are not syntactically incorrect, but are constructions which, if overused, may result in poor writing, and as such are often included in style-checker hit-lists. Thus, multiple embedding constructions, potentially ambiguous syntactic structures and garden path sentences are included in this category. A typical example of a garden path sentence is The horse raced by the barn fellgif. These problems are detectable by simple counting or recognition of syntactic forms.
  4. Missing or extra elements:
    These are cases where elements (either words or punctuation symbols) are omitted or mistakenly included in a text. For example: unpaired delimiters; missing delimiters; missing list separators; double syntactic function, etc.

These error types are associated with general error detection/correction mechanisms in The Editor's Assistant. The third type (associated with syntactic awkwardness) was similar to pattern-recognition algorithms, since it merely recognised patterns in the analysed text. The other three types were based on different error hypothesis schemes and relied on an extra processing loop to test whether performing the suggested correction on the hypothesised error actually led to a better sentence. Work focused on developing implementations of these general mechanisms.

PC Magazine Laboratories' performance tests

In May 1993, PC Magazine published a review of writers' tools based on their own performance tests (Rahmstorf93). In this section we will briefly describe the methodology adopted by PC Magazine Laboratories in performing grammar checker tests.

The systems reviewed were Correct Grammar for Windows, version 2.0 and Correct Grammar for DOS, version 4.0, Grammatik 5 for DOS and Windows, Rightwriter, version 6, and CorrecText.

The test material comprised examples of typical punctuation and capitalisation errors and 75 typical grammar and style errors. Each error occurred in two different sentences to avoid the potential problem that a single example would be unusually easy or difficult for a particular program to detect. In many cases a grammar program could detect only one of the examples of a problem.

The evaluation of the output for each sentence involved giving one of four possible responses and assigning an appropriate score, as shown in table D.1.1.2.

 

 

Response ScoreComments
Hit1The program flags an error.
Miss0The program does not detect an error.
Advice0-4The program gives one or more good suggestions about how to correct an error. Each sentence with this response was given a score from 0 to 4 according to the quality of suggestions given by the program (a score of 4 means that the error was flagged and all elements contributing to the problem were detected).
False flag<0The program detects a non-existent error or recommends a correction that introduces an error. Negative points were given for flagging nonexistent errors or providing advice that introduced new errors.
Table D.3: PC Magazine Laboratories -- responses and scores

An additional score used is Hits-to-False-Flags Ratio, being the relative number of hits and false alarms. The higher the ratio, the less likely that the program will report nonexistent errors or offer bad advice.

Grammar checkers were also evaluated for their spell-checking abilities. For this purpose, the test material contained 100 misspelled words of ten different categories. 1 point was given for detecting a misspelled word and suggesting the correct alternative.

Chandler's review of grammar checkers

(Chandler89) reviewed a number of grammar checkers for Electric Word. Chandler, an English teacher at the University of Hawaii, was interested in assessing the usefulness of grammar checkers for his students, who included a large number of Asian second language students. The systems he tested were Grammatik III, Correct Grammar and Critique. He assessed the systems not only for their linguistic accuracy in detecting errors (grammatical and usage errors) but also for their user-friendliness. Under the second category he considered factors such as the helpfulness of the tutorial (one that presented examples was more helpful than one that just stated rules) and the type of editor. Chandler also considered it helpful if parse trees were presented to the user. Additional factors that he took into account were how easy it was to add words to a system's dictionary and the price of the system.

Chandler's assessment of the systems according to the dimensions outlined above appear to be entirely subjective, except in the assessment of error detection, where he used a very small test suite. He ran the test suite twice through each system. The phenomena tested were intended to contain typical errors made by Asian second language students. Errors included grammatical errors (including agreement, comma splices, and subordination and modifying phrases) and usage errors (e.g. a for an and childrens for children's).

Conclusions

The evaluations described in the above reviews differ vastly in scope and rigour of approach. Nevertheless, there is a fair amount of agreement on which characteristics are important when assessing the performance of a grammar checker. These performance characteristics or functionality characteristics include:

In addition to these performance characteristics, a number of other characteristics were reported on that are considered important in assessing grammar checkers, including:
next up previous contents
Next: Selection of grammar checkers Up: Introduction Previous: Introduction

ceditor@tnos.ilc.pi.cnr.it