A Taxonomy of grammar errors

The taxonomy of errors can be thought of as being derived from a combination of the proofed text model and the errors likely for particular writer types. Alternatively, it can be thought of as derived from analysis of texts before and after the notional process of copy-editing, that is, comparison of the unproofed text model and the proofed text model. In both cases, we are interested in errors that actually occur. The purpose of the taxonomy is:

Given these purposes, we must establish a principle to use in classifying grammar errors. The two derivations of the taxonomy we have offered (proofed text plus writer error sources, or proofed text compared with unproofed text) come from different directions, and must be brought together to form a useful classification.

If we consider the approach from the proofed text and the sources of writer errors, we could classify errors in terms of their source in the writing process. A few obvious types in this classification, as mentioned in the writer model, would be: slips of medium (typing errors, OCR errors, cut and paste slips...); dialect differences between the writer's language and some standard language; second language errors; concentration lapses resulting in `derailed' sentences; and other performance errors. Such a taxonomy has the advantage that if we have a proper writer model, we cover all errors that result in ungrammatical text, and it may fit the writer's and end-user's categories of thought and thus permit easy mapping on to customer-reportable attributes, which is an important purpose of the taxonomy. However, our writer model would then have to be a detailed psycholinguistic model of language competence and performance, and this seems rather a tall order. In practice, the source of our writer model is likely to be an analysis of proofed and unproofed texts, that is, working back from the second type of derivation of the taxonomy of errors.

Comparing proofed and unproofed texts as a way of arriving at errors has the practical advantage that it is definitely true -- as far as it goes. We can relate this comparison to the traditional copy-editor's marks which simply give editor operations such as moving, deleting, adding, transposing and so on. However, this does not contain much information about the linguistic causes or categories underlying errors and thus gives us no useful categories upon which to base frequency analysis or reporting attributes, or test generation. Even for a single user, a given error example could have multiple possible causes in such a taxonomy; only the writer could possibly tell, and possibly not even them. In human copy-edited text, the semantic information is supplied by the reader (the end-user in our model), and in creating writer models and error taxonomies from proofed and unproofed text pairs the researcher must attempt to produce a classification that is informed not only by actual text occurrences but by some idea of source.

Aspects of both approaches are combined in the design of our taxonomy. The taxonomy is arranged in terms of the writer sources of errors, to give the semantics that will support the end-user task, each mapped on to a set of transformations of the model of proofed text at the analytic level appropriate to the error type. Some examples:

