This appendix is a selection of methods for the measurement of software that can be applied to any software system, but it was compiled in view of the special problems that software evaluation poses in the NLP area. As has been repeatedly pointed out, each evaluation scenario asks for a particular combination of test types and instruments just as each environment asks for the measurement of different metrics. Thus this report has to be understood as practical help and guideline for the testing of NLP software, as it fairly exhaustively discusses the possibilities at hand, their merits and drawbacks.
It is suggested that working groups within EAGLES evaluation assess the different test types and instruments and test the practical guidelines delivered in this appendix. Although it is based on practical evaluation experience with NLP applications, further practical testing exercises will certainly lead to a useful discussion and further development of the testing framework as it is presented here.
A long term objective of the EAGLES evaluation group is the standardisation of evaluation procedures. Here test types and instruments are among the prime candidates for standardization. It is hoped that this report will contribute to this effort.