Under the term systematic testing all testing activities will be subsumed that examine the behaviour of software under specific conditions with particular results expected. The term ``systematic testing'' was shaped by (Hoge93b, 10). Whereas the objectives behind scenario testing ask for the integration of users into the testing exercise, systematic tests can be performed solely by software engineers and/or user representatives. Survey of Glass and Black Box Testing Techniques describes a number of objectives behind testing that need further elaboration in order to satisfy the needs of user-oriented testing. There are three objectives that are particularly relevant for user-oriented testing, i.e.
Accordingly, user-oriented systematic testing will be split up into task-oriented testing, menu-oriented testing, and benchmark testing.
Task-oriented testing is performed to examine whether a piece of software actually fulfils pre-defined tasks. These tasks may either be stated in the requirements specification document as in software development projects, or may be implied by third parties as for instance in consumer reports. Task-oriented testing is related to scenario testing in that its major purpose is to assess the overall functionality of the system by means of relevant data inputs, as well as to examine the quality of the data output. There are several reasons why one may decide for task-oriented testing instead of performing scenario tests, i.e.
As for scenario testing, for task-oriented testing a top-down approach is most adequate. The final success of task-oriented testing lies in the exhaustive definition of a representative number of test tasks and subtasks as they are likely to occur in every-day routines and are covered by the system. For this purpose, the evaluator needs to consult a number of users and discuss the technical and organisational constraints of their every day work as well as the type of task they have to perform by the aid of the software. Having performed task-oriented testing, it is important that the data output be examined thoroughly and discussed with the end users of the system.
Task-oriented testing can be carried out during the software development process at any stage of the software life-cycle as well as with any off-the-shelf software product. Generally, task-oriented testing does not ask for extensive test planning. Since task-oriented testing does not involve users as subjects (users should still be involved in the definition of the test task), the overall organisation of the test is less demanding. The testing environment is normally the working place of the evaluator and is principally not relevant for the interpretation of results.
The major effort of the test preparation phase lies in the definition of the test tasks. Whereas in scenario tests, the test task needs to be standardised, task-oriented testing can define a broad range of test tasks which may all be relevant to one user or the other. While in scenario testing the sub-tasks have to be defined beforehand, task-oriented testing leaves more space concerning an investigation of the possible ways of performing a given task. Whereas in field tests, for instance, problems in the performance of the test task lead to an interruption and therefore a rather costly failure of the whole testing exercise, task-oriented testing allows a repetition of test tasks, while documenting the problems encountered. In addition to the metrics which could be defined in the test preparation phase, it is often the case in task-oriented testing that additional metrics can be applied or additional data be tested, once the detailed functionality of the system is accessible. The number of testing instruments that are applied during task-oriented testing is restricted to checklists containing the tasks, sub-tasks and related metrics and reporting instruments such as result reports (cf. result sheets in (Hoge93b, 10), or metric-level tally sheets in (Murine83, 374) ), and test problem reports (Deutsch82, 290) (cf. failure sheets in (Hoge92, B-3/1-B-3/7)).
Having prepared the test tasks, data and metrics in the preparation phase, the actual testing phase is not very demanding and can be performed by anyone who has some knowledge of evaluation and the definition of metrics, i.e. by software developers, users or management.
The costs of task-oriented testing are comparatively small and the PH invested mainly depend on the number of tasks tested. Apart from the technical environment of the evaluator (hard- and software) no extra investment into testing equipment or instrument is necessary for task-oriented testing.
The primary quality characteristic under investigation in task-oriented testing is functionality. In this sense task-oriented testing comes close to what is called ``functionality testing'' in Black Box Testing - without User Involvement, i.e. investigating whether the program does what it is supposed to do. Being able to test a great number of different tasks, the suitability of the software, i.e. the presence and appropriateness of a set of functions for specified tasks, can be closely examined (ISO91a, A.2.1.1). Another important aspect of task-oriented testing is the examination of the quality of data output as it is described under accuracy, the ``attributes of software that bear on the provision of right or agreed results or effects.'' (ISO91a, A.2.1.2). When performed at the final installation place of the software, task-oriented testing can also deliver valuable results concerning the interoperability of the software. For this purpose, the tasks must cover the communication with other applications and/or users within the given environment. In addition the usability of the system can be assessed in a different way than in scenario testing: one of the most frequently applied metrics of operability in task-oriented testing, for instance, is the counting of steps necessary to perform a certain task (how many actions are required to perform a given subtask).
The metrics applied in task-oriented testing deliver mostly boolean (presence or absence of functions), quantitative (e.g. number of steps etc.) or classificatory (how well a function performs a task) values. While both boolean and number are objective values, classificatory values are based on the subjective impression of the evaluator. While in scenario tests the originally subjective statements of users can be ``objectivised'' by involving a representative number of users in the tests, task-oriented testing is normally performed by only one (better two) evaluators and thus has to be seen in the light of subjectivity. While the boolean values that determine the presence or absence of functions can be considered highly reliable, the classificatory values that denote the quality of the implementation of the individual functions are not necessarily repeatable when involving a different evaluator and can therefore not be considered particularly reliable.
While both scenario and task-oriented testing are mainly geared to examine the handling and functionality of the software, the philosophy behind menu-oriented testing comes closest to the developer's principal aim, i.e. the discovery of software problems. The basic idea behind menu-oriented testing, i.e. the ``testing of each program feature or function in sequence'' (Musa87, 521), is prominent in many glass and black box testing techniques. As in the structured walkthrough (cf. Static Analysis Techniques), the software is examined from top to toe, considering each individual function as it is sequentially offered in the menu bar. Instead of walking through the code, during menu-oriented testing, the evaluator ``walks through'' the executables, following each possible path of program execution. Cf. path testing in section B.3.1.2 Dynamic Analysis Techniques. Thus, while in both scenario and task-oriented testing only particular functions are performed, namely those that are necessary to perform the test tasks, in menu-oriented testing each function of the software is executed at least once.
Similar to task-oriented testing, menu-oriented testing can be performed at any stage of the software life-cycle as well as with off-the-shelf products. For menu-oriented testing little time has to be invested in the test planning and test preparation phases. No users have to be found and the test does not have to fit into any operational environment. It can easily be performed in isolation on the evaluator's computer, which, however, reduces the number of metrics that can be applied (neglecting for instance aspects of interoperability).
Compared to the extensive test preparation phase of both scenario and task-oriented testing, the activities of the test preparation phase in menu-oriented testing are reduced to a minimum. No specific test task needs to be defined beforehand and, therefore, the time-consuming consultation of users is not necessary. Instead of following pre-defined tasks, the evaluator investigates any possible way of handling the system. Since the test is not performed along a certain pre-defined task, only few principal metrics can be defined in advance. The preparation of the instruments necessary for menu-oriented testing - mainly result reports and software trouble reports - is neither difficult nor time consuming.
The major effort in menu-oriented testing clearly has to be spent during the testing phase, since apart from applying pre-defined metrics, it may be necessary that the evaluator has to develop metrics and elaborate test data, perform tests and document the results on an ad hoc basis while executing the software. Only when executing a certain function, the evaluator can guess which data is needed to perform certain operations with the functions offered in the menu bar (the WHAT HAPPENS IF ... test). In the case of a terminology elicitation system, for instance, a volume test can be performed by executing the option ``do concordance'' with an unusually big file. Cf. (Hohmann94, 71), where the result of this type of volume test with the terminology elicitation system ``System Quirk'' is documented. For a translation memory a stress test could be performed, accessing, for instance, a certain number of parallel texts by different users on the LAN at the same time. Recovery tests could be performed with a termbank when e.g. simulating a system breakdown (e.g. on PCs by means of pressing control/alt/del keys) before and after having properly saved terminology modifications. It is obvious that, compared to task-oriented testing, successful menu-oriented testing presupposes a good deal of both evaluation expertise and experience.
The costs of menu-oriented testing mainly lie in the recruitment of excellent evaluation personnel that is capable of the ad hoc generation of metrics and data. Similar to task-oriented testing no investment is necessary for additional testing instruments.
In menu-oriented testing nearly all user-related quality characteristics can be assessed. While suitability, understandability, operability and learnability should also be assessed by means of scenario or task-oriented testing, the major benefit of menu-oriented testing lies in delivering results on the system's reliability, including characteristics such as maturity, fault tolerance or recoverability. On the functionality side, menu-oriented testing delivers valuable results concerning the system's compliance with other standards, e.g. whether a windows application consistently makes use of the same windows system messages as other applications do, or whether the system is internally consistent with its labelling of functions and processes etc. Also system security is one of the characteristics that only menu-oriented testing can sufficiently investigate, since unlike in task-oriented and scenario testing, it is certain that every function is performed and checked.
The results achieved by means of menu-oriented testing are mostly qualitative and descriptive in the sense that any observation related to the different quality characteristics is directly noted on the result report. If pre-defined metrics are applied or metrics performed such as volume, stress or recovery tests, the results are either boolean, classificatory, or, in rare cases also number. For qualitative statements, the objectivity is comparatively low, since menu-oriented testing is performed by individuals. Boolean and quantitative results delivered by means of volume, stress or recovery tests can be considered objective, since other evaluators are likely to achieve the same results when keeping data input and environment of the system constant.
Benchmark tests examine the performance of systems. The notion of performance can be applied either to individual functions, modules or to the overall system. In the strict technical sense, a benchmark test is the measurement of system performance without being dependent on personal variables (Thaller94, 146). Thus, following the narrow definition of benchmark, there are very few possibilities of applying benchmarks on the module or even system level of interactive systems. Different views on the definition of benchmark can be found in (Lewis90, 337) and (Oppermann88, 12). Examples of benchmark tests in the NLP area are rather on the function level, e.g. the measurement of success rates for automatic terminology retrieval functions, the measurement of translation retrieval rates for translation memories, the measurement of time for the parsing of a text etc.. Benchmark tests allow the comparison of the performance of different tools. When performing the same benchmark with different systems, it has to be kept in mind that both system parameters and environment variables are kept constant. The comparison of the benchmark results only makes sense if different translation memories, for instance, have access to the very same background material and are tested with the very same test text.
For benchmark tests, the test preparation phase is most decisive, since it involves the identification of appropriate units or functions which operate without being influenced by the evaluator, the selection of test data, and the fixation of the measurement technique. The testing environment is a decisive factor influencing the result of the benchmark and thus needs to be documented carefully. (Cf. Test Descriptions). Typical instruments applied for benchmark tests are checklists that cover the quality characteristic, the benchmark, measurement technique and results, or, if a benchmark involves the execution of more than one function, it is useful for testing purposes if shellscripts are programmed that include the sequence of function calls as they are performed for the benchmark test.
The major quality characteristic assessed by means of benchmark tests is efficiency. Apart from time and resource behaviour (ISO91a, A.2.4.1-A.2.4.2), an important quality characteristic that falls under efficiency is output behaviour. Particularly in the NLP area a system's capability to produce a certain amount of output in a given time is often measured in benchmark tests.
The types of result achieved by means of benchmark tests are mostly numbers, e.g. the time needed to perform a certain function, the resources needed when performing a function, and, last but not least, the amount of output data produced in a given time.