This appendix gives an indication of how some grammar checkers can be evaluated in the light of EAGLES work on writers' aids. The tests were not intended to be exhaustive, but to validate that the methodology chosen is on the right track. In addition to a grammar checker for English, a grammar checker for another language was also tested, to verify that the methodology is applicable regardless of the specific language in question.
Two grammar checkers were taken as a basis for testing: one dealing with the English language and one with French. Since we are not dealing with the evaluation of a specific product, but indeed with the testing of our own methodology, we prefer to refer to the grammar checkers without using their real commercial names. So we refer to the English grammar checker as E1, while the French one will be referred to as F1.
For the first round of testing undertaken on E1 and F1, two types of texts have been used. One was extracted from the collection of texts of economic bulletins written by Union Bank of Switzerland (UBS). These are intended for the general public, and although some economic terms inevitably appear, they do not feature too specialised a vocabulary. These texts were supplied in their revised, published format. An advantage of using them was that they appear both in English and French, so that it has been possible to compare the performance of the two grammar checkers on equivalent texts. We have reasons to believe that the transfer of the texts to us resulted in occasional typographical errors. This was the case for most hyphenated words, where the first letter of the second word disappeared, as for example, Etats-Unis vs. Etatsnis in the French texts, and long-term vs. longerm in the English ones.
The other type of text involved a few series of sentences written especially to test the checkers. These can in turn be subdivided into sentences that contain a grammatical error, sentences that contain no errors and sentences that contain no errors but feature a certain structure meant to trap the software.
For the second round of testing only constructed material was used.
The overall impression is that both checkers do catch various types of error, but that overflagging (that is, indicating as an error a sentence or a part of speech that is in fact correct) is fairly high, which greatly reduces the usefulness of the systems.
As the testing for E1 and F1 was conducted on two types of text, the outcome of the testing activity will be described separately for each test type.