Many of the various knowledge elicitation techniques which have become established in KBS development are relevant to our more specific task in NLP evaluation. This is particularly true when considering the presentation of material to end-users, when their natural categorisations of phenomena (rather than the categorisations of linguists or computer scientists) must be determined. It may also, however, be true for experts such as the linguists and computer scientists who build the kind of systems under test -- establishing their understanding of difficult areas where systems are weak for particular reasons of the techniques used and the phenomena. Such `inside' knowledge can support the development of `trick' tests which may serve to diagnose the kind of techniques being used in a system, which in turn may predict a whole aspect of its performance on user requirements.
Organised knowledge acquisition methods assist in the standardisation of the requirements they result in, in the sense that a specific method can make it easier to check or repeat the knowledge acquisition process. An extremely useful survey of knowledge acquisition methods and procedures, with suggestions about circumstances in which different methods are useful, is given in (Cordingley89).
Methods range from informal techniques such as `user observation' through common social science methods such as interviews, questionnaires, and discourse analysis (Ericson84) to more formal techniques used in KA for KBS. The latter are less well known in general, and are briefly introduced here.
The identification of relevant phenomena requires open-ended techniques, perhaps using some sort of scenario walkthrough technique with users. The first step is to identify suitable information providers and a core idea of the tasks of interest. `Natural' observation may provide information about typical situations and tasks, but it may also be relevant to deliberately elicit information from informants about various common types of situation, as well as rare but important situations, and walk through these to gain further information, a sort of scenario-based requirements elicitation process. At the early stages, it is crucial not to have a ready-made list of categories into which users' views of the domain must be fitted; this can distort the information and prevent the uncovering of relevant phenomena.
As relevant phenomena are identified, it is necessary to categorise them, although the two activities cannot be separated so cleanly. A number of relatively formal methods are in use (e.g., repertory grid analysis, laddering, card sorts -- see (Cordingley89) for details), largely based around the ideas of Personal Construct Theory (Kelly55) , providing constrained sorting or organising operations to allow the development of categorisation schemes based on the identification of contrastive discriminations whose labels are the constructs. In card sorts, for instance, some small and managable set of elements are chosen, and the subject sorts like with like in as many ways as she can think of; each such distinction can be associated with a labelled construct. For instance, a set of spelling checkers might be sorted on the basis of whether they allow sharing of user-defined dictionaries; this would then become a construct. In the use of this sort of technique for product design or evaluation, it is common to include in the list of items to be sorted some `ideal' product, to facilitate the production of constructs which do not correspond to any discriminations between existing systems, but which are guides to real user needs or wishes.
Constructs, as used, for example, in repertory grids, look very much like reportable attributes; a typical repertory grid based on elements which are spelling checkers might be represented as a matrix with one axis listing different spelling checkers and the other various attributes of the checkers, so that the values in the grid characterise the checkers. The main purpose of the grid, however, is not to evaluate the checkers but to elicit constructs/attributes that validly represent the important discriminations to be made among checkers. The use of an automated software system for representing repertory grids allows the values to be used in statistical analysis to identify clusters of constructs, and compare categorisations by a number of subjects. The labels used with personal constructs are a matter of personal usefulness, and so the process of moving from a set of personal constructs to a sharable set of attributes suitable a wider audience may involve negotiation; there has been some work done on techniques for the identification of clashes in terminology and usage (Shaw89) .
There are a number of software systems available which can automate elicitation and analysis using repertory grids, such as WebGrid (Gaines and Shaw, http://tiger.cpsc.ucalgary.ca/KAW); (Gaines93).
Corpus analysis remains the main distinctive method in NLP evaluation. Corpus analysis is a recognised part of KA, where protocols of interviews, expert think-alouds, etc., form the documents -- there are even attempts to extract domain models, etc., automatically from the transcripts of experts. In these cases, however, the documents are a secondary object of analysis; in NLP analysis, they are often the primary one. In fact, while NLP suffers in comparison with some traditional SE domains in the fact that the input and output of its systems can't be easily and concisely described, it has the advantage that extensional descriptions of required behaviour (instances of linguistic input and output) can be relatively readily obtained, and constitute objective evidence. The effort involved in analysing this evidence may be considerable, but the effort in collecting and making available suitably prepared corpora is greater. However, it is the only way to provide reliable information about the linguistic nature of the problem domain, as long as the documents are in fact representative of the problem domain
Corpus analysis can be usefully linked to some of the techniques described above, since the presence of a corpus does not determine how to identify or categorise the phenomena represented there. Paying attention to who has categorised the phenomena in a corpus, and what specific methods if any were used, should be useful.