In this section, we briefly summarise some of the test materials frequently used in testing, mentioning some of the advantages and disadvantages associated with each.
Test sets, in the sense used here, are collections of naturally occurring text in electronic form. Some of the best known available are the Brown Corpus of English (BNC95), the Trésor de la Langue Française corpus of French (TLF95), and the bilingual (English-French) material drawn from the Canadian Hansard (Canada95), the latter forming a test set of parallel texts. With the power of increased computing capacity to process and store large amounts of text, a strong interest has been developing recently in collecting test sets and in defining tools able to make use of them. (See, for example, (Liberman89)). In line with this, the Association for Computational Linguistics has launched a Data Collection Inititaive (DCI) and the Linguistic Data Consortium is specifically concerned with collecting test sets.
The LRE project MLCC is engaged in collecting plurilingual and parallel multilingual material for the European Languages.
These efforts are all concerned with collecting what might be called general material: that is, they do not aim to reflect some particular pattern of needs or some specific set of uses to which the material may be put, but rely on what text can be found in machine readable form in large quantities. Although this does not detract from the value of such collections of linguistic data, it does raise questions about their representativeness. For example, the Hansard material is clearly representative of the English and French used in the Canadian Parliament, but is highly unlikely to be representative of the French used in technical documentation or the English used in school text books. Why this is a problem is most clear in the case of evaluations designed to test a system's ability to deal with some particular type or types of text; it will be pure serendipity if a general test set contains texts of the relevant types.
On the other hand, there is a strong sense in which every test set is representative of something, and this in its turn can lead to misleading results if the test set is mistakenly taken to be representative of, say, English or French in general.
The use of general test sets must then be approached with some caution. (see: URL: http://www.ilc.pi.cnr.it/EAGLES96/corpintr/corpintr.html)
By test suite here, we mean sets of inputs, artificially constructed and designed to probe the system's behaviour with respect to some particular phenomenon. The main problem associated with test suites is the complexity of their construction. Even at the level of syntactic phenomena, there are problems in defining inputs which will test precisely what one wants to test, and once semantic, pragmatic or translation phenomena are taken into consideration, test suite construction becomes a very delicate matter indeed.
Furthermore, test suites can quickly become unmanageably large. A principle usually adopted is to design one input per linguistic phenomenon to be tested, in order to isolate the system's behaviour with respect to that phenomenon. However, real text rarely contains one interesting linguistic phenomenon per sentence, and much of the real interest in a system's behaviour is in looking precisely at what happens when the input contains interacting phenomena. A test suite based on constructing one input per phenomenon is already large if any serious attempt is made to cover a language exhaustively. Once interactions are to be accounted for, the problem of size becomes critical. Notice, too, that size is a problem not only in constructing the test suite but in administering it and analysing the results. These and some of the problems specifically associated with constructing test suites for evaluation of machine translation systems are discussed in (King90).
All these considerations constitute a good argument for thinking of general test suites, of the sort envisaged by the Hewlett Packard group, as good candidates for collaborative development. A note of caution is in order, however, if collaborative development is taken to mean simply pooling such test suites as exist. Test suites are often designed in the context of testing a specific system. There is a danger in that case that, deliberately or inadvertently, they are attuned to that particular system, thus limiting their applicability when other systems are to be evaluated.
The LRE project TSNLP is engaged in drawing up and exemplifying guidelines for the construction of test suites.
By a test collection we mean a set of inputs associated with a corresponding set of expected outputs. In information or document retrieval, for example, a test collection consists of a set of documents, a set of queries or topics and a set of relevance judgements, which identify the individual documents relevant to the individual topics or queries. Typically, these elements of the test collection are divided into training sets and test sets.
The most costly element in creating such a test collection is the creation of the relevance judgements. When a large set of documents is concerned, the effort involved is so great that means are employed to allow the evaluation to be conducted with an incomplete set. Substantial effort is also required to develop the queries and topics: for the TREC evaluations mentioned in the section on recent history, they were designed by information analysts to present varying degrees of difficulty and to cover a wide range of subject matter.
In the MUC-3 evaluation, the filling of the templates for the terrorism test collection required not only making relevance judgements but also determining how many relevant terrorist incidents were reported in a given document (and, therefore, how many templates to generate) and which passages contained explicit or implied information pertinent to each of these incidents. This task was carried out in a co-operative venture involving the evaluators and the conference participants. Virtually every text presented difficulties of interpretation due to vagueness, ambiguity or outright self-contradiction, or due to inadequacies of the template representation or task documentation. Therefore, as the template filling task was proceeding, the documentation containing the task and output specifications had to be refined. (For a lively discussion of this effort, see (Lehnert91).)
The value of such a test collection in evaluating systems comparatively to discover their adequacy with respect to the task defined by the test collection cannot be denied. The effort and cost of constructing the test collection leads naturally to wondering how the test collection can be re-used elsewhere. To take the MUC case, first, such a collection provides an obvious and valuable source of test materials to research groups working in the domain, independently of participation in the MUC conferences. A new set of issues arises if the collection is to be used for progress evaluation. In order to obtain direct comparability of results, the collection and the evaluation metrics should remain absolutely stable. Yet freezing the collection prevents it from being as good as it might be. For example, in the MUC case again, for information retrieval, additional relevance judgments will improve it, and for information extraction, higher quality filled templates will improve it.
Although re-use of a test collection can be made for subsequent formal evaluations (usually changing the test set), a change must eventually be made, with all the time and effort thus implied. This is especially true for the MUC collections, which have concentrated on just one domain. If no change is made, the collection may at some point start to hinder progress in the field, as it tends to encourage researchers to focus on certain key problems and to ignore others, and, once a system reaches a certain level of maturity, it may encourage researchers to spend more time tweaking the system than tackling the remaining major issues. However, it is certainly true that the TREC and MUC collections are sufficiently large, challenging and well-defined to support research and development in information retrieval and information extraction for a long time, even after they are no longer being used for formal evaluation.
However, the high cost of designing and constructing test collections makes it hard to imagine their being constructed outside the evaluation guided research paradigm, where the investment implied by a number of different groups working over a considerable period at essentially the same task can be used to justify the expense involved.
With all the test materials we have discussed, there is a tension between constructing test material which is in some sense general and can be shared across different evaluations, and the common sense feeling that most evaluations are specific, at least in the degree that the system has been constructed to carry out some specific task and should be evaluated on its ability to do so. This is as true for evaluation methodologies as it is for test materials. Designing and carrying out an evaluation is costly both in time and in money; it would be helpful to everyone concerned if ways could be found to share methodologies. But, at the same time, each evaluation is very specific to a particular system, and, perhaps even more importantly, to a specific environment in which the system should work. Increased experience and widespread discussion of evaluation techniques as a topic worthy of consideration in its own right should lead to a better consciousness of what can be shared and what not.