Successful testing and evaluation is not only a matter of the choice of metrics and the optimal combination between test type and instrument for the particular test environment, but also includes a detailed, correct and adequate reporting of the test results. Among those reporting instruments that are of relevance for user-oriented testing are
Test descriptions cover all factors that influence the overall evaluation procedure, including all details that are necessary to judge the performance of the test and to verify the interpretation of results. There are four major factors that determine the type of evaluation and the corresponding testing exercise, i.e. first and foremost, the motivation behind evaluation (Galliers93, 186), the system and its parameters, the evaluation environment, and, finally, the quality requirements that need to be tested.
Figure B.2: Factors Determining Evaluation Procedure
Galliers and Sparck-Jones rightly stress the importance of determining the detailed motivation behind evaluation (Galliers93, 141,186). Particularly the three principal aspects that denote the motivation behind evaluation need careful consideration, i.e. perspective, interest and consumer of the evaluation:
the perspective of evaluation denotes whether one is interested in the tasks which a system takes over (task-oriented), or the amount of money that can be saved when implementing the system (financial), or how the system can be included into an existing working environment (administrative) etc. (Galliers93, 186).
the interest (possible interests reported by (Galliers93, 186) are developer/funder ...) taken in the evaluation denotes the view taken on the evaluation exercise, i.e. even for the same type of evaluation, a developer may have a totally different view on what needs to be considered than the funder of a project, who will again put different foci than user-organisations etc.
the consumer (possible consumers reported by (Galliers93, 186) are manager/user/researcher ...) of the evaluation report denotes whether managerial, scientific, practical or implementation related aspects are focused during evaluation and reporting (Galliers93, 141).
The system under testing is another factor that needs to be closely considered in an evaluation exercise. The settings of system parameters such as hardware platform, software modules and the state of the system, i.e. whether the system is a prototype, -version or product, are givens and strongly determine the results of the evaluation exercise. For evaluation purposes it is important to note the difference between external system parameters which normally remain constant during an evaluation exercise and internal system variables, i.e. ``properties of or constraints on the system's inputs and outputs'', (Galliers93, 24) the values of which may be changed for evaluation purposes. In the case of a translation memory, for instance, the value for the target fuzzy match percentage may be changed from 90 to 60 which will inevitably lead to more translation proposals. However, whereas changing internal variables will only lead to different evaluation results for the same metric, changing the parameters may also lead to the choice of different metrics. If, for instance, a system prototype is evaluated, metrics measuring efficiency are not likely to be applicable, while for product evaluation, efficiency certainly is one of the most important quality characteristics.
The evaluation environment is generally determined by the test personnel, the size of the budget and the amount of time invested. Test personnel can be divided into two major groups: (i) experts that function as evaluators during an evaluation exercise and (ii) users that function as subjects of tests (Falkedahl91, 20). On the evaluator side, it is important to note the number of evaluators participating in tests, their function, i.e. whether responsible or assistant, their educational background, and, last but not least, their experience with the test types and instruments used. User-oriented testing should preferably be performed by someone who was not involved in the development of the software (Thaller94, 61,90) and (Hausen82, 119). A certain knowledge of software engineering principles, and of both the practical and the computational side of the application under testing is, however, very useful, if not indispensable.
On the subject side it is particularly important to note how many and what type of users participated in tests. Depending on the quality characteristics and the metrics applied, as well as on the test type and instruments, different types of user - students, professional users - can or rather need to be used as subjects (Falkedahl91, 21). The crucial qualifications required by test persons are (i) objectivity, (ii) representativity, and for all NLP applications (iii) languages required (Falkedahl91, 21). Thus for each individual user the above three qualifications have to be checked in the test planning phase and the background noted in the test descriptions. The quest for objectivity is particularly difficult in the NLP area, where there are rather few people who have some idea of the functionality of the systems, but at the same time do not have previous experience with one system or the other, and consequently might judge the system under testing with respect to the similarity with the systems they know. ``It must therefore be noted that striving for objectivity runs the risk of making evaluations and tests using purportedly unbiased evaluators and test persons into purely artificial events whose results are likely to be insufficiently informative about or representative of how a system would be judged by its end-users, once they had become accustomed to its peculiarities.'' (Falkedahl91, 22).
The evaluation budget is naturally the most decisive factor, when it comes to selecting evaluators, subjects, test types and instruments as well as when determining the time that can be invested into evaluation. In professional evaluation environments, planned and actual evaluation costs need to be calculated and noted on the test description. It is important to note here that, in case of a limited evaluation budget, it is advisable to reduce the number of metrics that will be tested and to select less expensive instruments, rather than to reduce the number and qualification of test personnel. While a limited number of metrics only reduces the scope of the evaluation exercise, savings in the area of test personnel lead to less reliable test results.
The notion of software quality cannot be defined in general terms. It is a function mainly of perspective, interest and consumer of the evaluation exercise. While, for instance, from a financial perspective, efficiency is the most important quality characteristic, an administrative perspective will rather focus on inter-operability and usability. Both consumer and, in particular, the interest behind the evaluation exercise are responsible for the view on quality that is being taken during evaluation. The glass box view on quality distinguishes between three dimensions, i.e. (i) data, (ii) system - interface, and (iii) system function, while the black box view only distinguishes between data and system dimension, neglecting the fact whether a certain result goes back to the system's interface or its functions. For instance, while a developer will be interested in the more differentiated glass box view on quality, the funder of a project will be satisfied with the less complex black box view. In practical terms, it will be enough information for the funder of a project to learn that the quality characteristic of understandability was judged low in a termbank, because e.g. the definitions were not understandable (data dimension), and the system as such was difficult to grasp (system dimension), whereas the developer additionally needs to know whether it was the windowing sequence that caused problems (system - interface dimension), or the problems rather stem from the implementation of too complex operations (system - function dimension). The type of metrics, i.e. whether quantitative or qualitative, depends to a great extent on perspective and, above all, the consumer of the evaluation. Even for the same quality characteristic, e.g. reliability, a scientist is likely to perform mathematical stress and volume tests, leading to quantitative values, while when aiming at the final user of the system, evaluation might rather cover the application of qualitative questionnaires. For details on stress and volume tests see Black Box Testing - without User Involvement.
In addition to the four factors determining the setup of an evaluation exercise - motivation, system, environment and quality - the three types of evaluation as defined during the Edinburgh evaluation workshop of 1992 (Thompson92) give some more general insights into the overall test bed and are therefore useful to be noted in the test descriptions. If, for instance, a software system performs badly in the -version and one has to find out why, the type of evaluation performed is diagnostic evaluation. If one has to find out to which extent a new version of a software system performs better than the old version, the type of evaluation is progress evaluation. Finally, if the aim is to find out whether the software system is suitable for a particular environment, the type of evaluation performed is adequacy evaluation. It must, however, be noted that the three types of evaluation, though conceptually clear, can occur in any combination during a practical evaluation exercise.
The final evaluation description document covers all of the above aspects, including a short description of test type, instruments and data used. Copies of the actual test instruments and data used for the tests have to be added as a separate appendix.
The major usage of test problem reports is in the framework of software development projects rather than with off-the-shelf products. Thus they are mostly relevant for diagnostic and progress evaluation rather than for those adequacy evaluation environments in which no feedback between evaluator and developer is planned or possible.
Test problem reports provide developers with the detailed description of problems that occur during testing. They are very important instruments that aim at the improvement of software under development. The most important part of test problem reports is the detailed description of the problem and the actions that led to the problem (Thaller93, 123), (Deutsch82, 289) and (Hoge92, B-3/1--B-3/7). In diagnostic evaluation environments a diagnosis of the failure is given and the action required is described, if possible (Thaller93, 123) and (Deutsch82, 289). For calculations such as MTTF (mean time to failure) or MTBF (mean time between failures) it is necessary to put down the exact time when a failure occurred. Another important aspect that needs to be noted in test problem reports is the priority ID of the failure. Failure Priority ID Score shows a representative failure priority score presented by Deutsch (Deutsch82, 289).
|1||fix immediately - catastrophic error, test cannot proceed|
|2||fix before test completion - serious error, severe degradation in performance, but can continue test process|
|3||fix before system acceptance - moderate error, specification can be met|
|4||fix by a specific date or event|
|5||hold for later disposition|
|T||nonrepeatable occurrence - problem will be tracked for reoccurrence|
|X||new problem - problem assumed to be serious but insufficient data available for analysis, investigation required|
In general test problem reports are an important part of the auditing of software projects and therefore have to take up every aspect that is needed for correct auditing and time-management (Thaller93, 123).
While for feature inspection, the results are reported directly on the feature checklists used during inspection, result reports integrate all results achieved by means of scenario tests and systematic testing. They cover all details on the metrics applied and observations made during the different testing exercises and are normally provided as appendix to the overall test documentation. On the contents of result reports see also (Oppermann88, 8) and (Murine83, 374). It is a means that allows interested parties to look up the detailed results of the tests.
Testing experience in the TWB projects led to an optimised design of a result report as it can be used for the documentation of any test environment that includes scenario and systematic testing (Hoge93b, 71-130). Depending on the type of evaluation performed, i.e. whether progress, diagnostic or adequacy evaluation, the result reports need to be more or less comprehensive.
|TYPE OF INFORMATION||ITEM OF INFORMATION||TYPE OF EVALUATION|
|related quality characteristic|
|evaluation||proposal for improvement||diagnostic|
|dimension of quality||progress|
|deadline for implementation|
While administrative information is relevant for all types of evaluation, adequacy evaluation requires a minimum of evaluation information, including the function that is discussed, the observation made or metric applied, the description of the results of the observation or metric, and, finally, the related quality characteristic. In addition to the above information, in diagnostic and progress evaluation it is important to note to which dimension of the system the result relates, i.e. data, system-function or system-interface, if possible, to give a proposal for the improvement of the software, how important it is for the evaluation side that the improvement be taken up, how the developers reacted on the results of the test and the proposals for improvement, and, finally, by when developers plan the implementation of the changes.
In case the result report is used in the framework of software development projects, additional information for the auditing and time-management of the projects is necessary, i.e. whether implementation deadlines were met or not, and if not why not.
Assessment reports are top-level reporting instruments that are based on the detailed data which is provided as appendix, i.e. result reports, test descriptions, and test problem reports. The testing of a software system leads to a great number of individual results that are documented mainly on the metric level. In order to gain a picture of the whole software performance, it is necessary to proceed from the specific result on the metric level over the respective sub-characteristic to a general statement on the top level of quality reporting, i.e. evaluating the system's performance in terms of its eventual functionality, reliability, usability, efficiency, maintainability, and portability. Cf. (ISO91a).
(Murine83, 374) point out that ``All observed and measured data is recorded at the metric level; scores at the criteria and factor level are generated therefrom.'' Evaluation of test results implies the comparison between the actual software performance and the pre-defined target quality standard (Hoge91a, 2) and (Schmied89, 6-9). Thus, the results on the metric level have to be considered in the light of the stated or implied requirements of the respective user group. However, in user-oriented testing it is not always possible to define the target quality in exact terms. This is mostly due to the facts that (i) in user-oriented testing there is an obvious lack of quantitatively measurable metrics, and, (ii) the capability of individual functions cannot always be anticipated by a testing team which normally does not include system developers. In the NLP area, for instance, it is often not clear what systems such as translation memories or machine translation actually can do, and what the best solution could look like.
Also, for each individual evaluation scenario, different focus is put on the quality characteristics. If, for instance, a tool for object-oriented programming is being evaluated, there is a clear focus on functionality, maintainability and portability, while for software developers as the target user group of the tool, usability is of minor importance. Or, if a remote batch machine translation program is being evaluated, its functionality is much more important than its usability, since there is only little interaction between system and users.
Another important aspect that has to be considered when evaluating the final performance of a software system, particularly of NLP applications, is the dimension of the system that is being considered. While for the black box view on quality, which is mainly used for adequacy evaluation, it is only relevant to distinguish between the system and data dimension, diagnostic evaluation asks for a glass box view on quality which further distinguishes on the system side between the function and interface dimension. The reliability of a translation memory, for instance, needs to be considered on the data dimension, i.e. whether the system actually recovers the nearest translation, whether it provides data on the originator and the date of the translation etc., and on the system dimension, i.e. whether there are frequent system breakdowns etc. It may well be that a translation memory is very reliable on the system dimension but on the data dimension, for instance, does not provide the originator of the translator and fails to offer the closest translation proposal.