Though ``the need to test systems in real work environments is receiving increased attention'' (Karat90, 352), there has been hardly any methodological attempt to define the exact nature of these kinds of tests, which have so far been mainly the domain of software developers.
The term ``scenario'' has entered software evaluation in the early 1990s (Lewis90, 337). A scenario test is a test case which aims at a realistic user background for the evaluation of software as it was defined and performed, for instance in the TWB projects (Hoge91a, 12), (Hoge91b, 10), (Ahmad93a, 322), (Hoge93, 166) and (LeHong92, 29). and later also adopted by the EAGLES evaluation group. It is an instance of black box testing where the major objective is to assess the suitability of a software product for every-day routines. In short it involves putting the system into its intended use by its envisaged type of user, performing a standardised task.
One may roughly distinguish between five major phases of scenario testing, i.e. test planning, test preparation, testing, data analysis and reporting. During each of these phases a number of problems have to be tackled by the testing team before the next phase can start. Figures
Central Problems of the Test Planning Phase,
Central Problems of the Test Preparation Phase,
Central Problems of the Testing Phase,
Central Problems of the Data Analysis Phase, and
Central Problems of the Reporting Phase, briefly outline the major problems of each phase and the corresponding tasks.
|costs||define evaluation project budget|
|software||decide whether system test or module test|
|type of scenario test||decide whether field or laboratory|
|location||decide where define technical environment|
|users||decide how many users, who and why|
|evaluators||decide how many evaluators and who|
|quality requirements||study user requirements select relevant quality characteristics define metrics and target values|
|test task||define task and sub-tasks|
|test data||decide whether test corpus or collection or suite elaborate test data|
|instruments||decide which instrument:|
|- test programs|
|elaborate instruments integrating metrics|
|test procedure||define tasks of evaluators|
|define steps of testing exercise|
|duration||define start/end of each step|
|elaborate time schedule|
|test-plan||elaborate test-plan fixing abovementioned aspects|
|time-management||make sure that time schedule is kept|
|organisation||distribute tasks among users and evaluators|
|observation||note aspects of user behaviour|
|trouble shooting||react on problems|
|document deviations from test plan|
|data viewing||analyse data and apply relevant metrics|
|data collection||consider results of relevant metrics of all subjects|
|calculation||define value types for all metrics|
|calculate statistical averages|
|calculate statistical variance|
|documentation||document test-bed precisely|
|document all decisions taken during all phases|
|document all deviations from test plan|
|provide total of testing data as appendix|
|evaluation||justify all interpretations of results|
It is of utmost importance to note that tests involving users pose the most difficult problems in the assessment phase, because the high number of personal variables such as education, experience, age, motivation, day-time, work-load etc. is likely to blur the objectivity of test results, while at the same time they rank high concerning aspects such as informativeness.
Two different possibilities to perform scenario tests are reported in software engineering literature, i.e. field and laboratory tests (Karat90, 352), (Crellin90, 330) and (Oppermann88, 13) which involve different testing environments, tasks, requirements on test system, user participation, instruments, testing expertise, and, last but not least, time and money constraints.
A field test is a type of scenario test in which the testing environment is the normal working place of the user who is observed by one or more evaluators taking notes, time etc. From a psychological point of view, the field test is considered to be the least obtrusive test in that it involves basically the same physical and social environment factors as normal work does (Karat90, 352) and (Oppermann88, 12). Among the physical environment factors which are likely to influence the behaviour of the user are the layout of the office space, crowding and noise level. The most important social environment factors are office atmosphere and the normal pace of work (people stopping by and requesting information etc.) (Karat90, 352). However, despite the advantage of displaying the every-day physical and social environment factors, a certain variance in behaviour can result from the psychological effects of being observed while working.
The task to be performed by different users during the field test should be standardised which guarantees that every user will encounter the same kind of problems and will have to perform similar operations to succeed (Moll88, 73). Ideally the overall test task fits well into the organisational routine of the user's every day work and was developed beforehand in consultation with a number of users of the same environment (Karat90, 352). An obvious advantage of field tests as compared to laboratory tests is that the test task can include problems of data transfer between the test-system and existing systems. To ease evaluation, the overall test task needs to be divided into sub-tasks, each identifying an operational unit of performance. For each sub-task the metrics that are of interest should be defined beforehand, so that the evaluator's attention is automatically focused on particular aspects of performance.
Closely related to the problem of the test task is the requirements on the system under testing. It is obvious that, if the test task can be considered as part of the daily organisational routine, the software system under testing needs to be in a highly operable condition. Thus field tests are most beneficial if the systems under testing are -versions of products to be launched in the near future, or off-the-shelf products. The more the system presupposes a deviation of the task from the normal routine, the less informative are the results of the field tests.
For both kinds of scenario tests it is important that a representative number of users participate in the tests. There are a great number of personal variables involved that can have a decisive influence on the performance of the system, i.e. in all cases computer literacy, motivation or day-time (Oppermann88, 12), and for the more complex NLP applications education, experience and expertise are of great importance. The naturally subjective results arrived at by means of scenario tests need to be statistically interpreted and ``objectivized''. The organisational environment of field tests, which do not involve much extra expenditure for equipment etc., normally allows the participation of more users in tests than in laboratory tests of comparable costs.
The instruments commonly used in field tests range from the simple observation of users and noting their behaviour and interaction times on evaluation checklists (Karat90), (Hoge91a, 17), (Hoge92, 7-14), and (Ahmad93a, 14) to pre-and post-testing interviews (Moll88, 72), (Hoge91a, 17), (Hoge92, 7-14) and (Ahmad93a, 14), think aloud (Moll88, 74) and (VainioLarsson90, 325), and, last but not least, logfile recording, a facility for recording all keystrokes during the user interaction in a separate file (VainioLarsson90, 325). The choice of instruments depends on various factors such as time and money constraints, technical facilities, evaluation expertise etc. Due to the limited possibilities of retrospective data analysis present in field tests, it is of utmost importance that the data gained with the aid of the different instruments (notes on user behaviour, interaction etc.) be analysed right after finishing the test, because otherwise important contextual information is likely to get lost. The evaluation setup of field tests generally puts heavy demands on the expertise and experience of the evaluator. Test planning is rather difficult for field tests, since the natural environment involves a great number of variables which cannot easily be determined beforehand. The optimal point in time needs to be found for integrating a new piece of software into the daily routines. The test preparation phase for field tests is more time consuming than for its laboratory counterpart, since user behaviour needs to be anticipated to a certain extent in order to pre-define applicable metrics. Similarly, the testing phase is highly demanding in field tests, since it is very difficult to filter out important from non-important user behaviour during the test. Also, from the organisational point of view, a field test is only satisfactory, if the normal working routine is interrupted as little as possible during the test. Experience together with well prepared evaluation instruments such as checklists etc. are among the most decisive means to guarantee that the results of field tests are satisfactory. Whereas laboratory tests provide the evaluator with various possibilities to record and re-play the different test situations, evaluators in field tests mostly have to rely on what they identify as important information during the various situations in a test. This makes the data analysis phase of field tests particularly difficult, since if not performed directly after the completion of the test, time is likely to blur the situational context in which results were achieved and thus the probability of inadequate data interpretation is high.
The final costs of a scenario test can be calculated from two major types of cost, i.e. personnel and equipment. On the personnel side, expenses are calculated by means of the PH (person hour) rates for the evaluators and users involved in the overall test. As demonstrated above, field tests involve a great number of PH for evaluators, particularly in the test planning and preparation phases. Mainly due to the natural environment, the number of users participating in tests is normally higher than in laboratory tests. However, the time that has to be invested for each participant, including introduction, test and post-testing interview, is much less than in laboratory tests. The major difference in costs between field and laboratory tests, however, lies in the equipment. From the equipment side, it is obvious that field tests ask for comparatively little expenses, because almost no investment in additional technical evaluation instruments is obligatory. Thus, given the availability of a certain expertise of evaluators, it is obvious that field tests mostly invoke less costs than their laboratory counterpart (Karat90, 355).
A laboratory test is an instance of a scenario test in which the testing environment includes a number of isolated users who perform a given task in a test laboratory which offers a great variety of data collection techniques (Karat90, 352). When interpreting test results, it has to be kept in mind that ``the end user's social and physical environment is not replicated in the laboratory, and these factors influence the way an end user works with an application.'' (Karat90, 352) and (Oppermann88, 13). On the other hand, laboratory tests allow the participation of system developers more easily than field tests, which gives a certain impetus to remedy problems in development projects (Karat90, 352). Like field tests, laboratory tests should cover standardised test tasks which allow the comparison of user behaviour. On the one hand, the isolated testing environment in a laboratory does not allow any assessment of data transfer routines and therefore has a rather modular character. For instance, in the case of performing tests with a translator's workstation, a laboratory test cannot test whether the terminology provided by the existing mainframe terminology management system at the customer's site can be integrated into the editing environment. On the other hand, since the task does not have to fit into the routines of every-day work, it is possible in laboratory tests to define less comprehensive test tasks as frequently needed for testing system prototypes. The nature of the test task, i.e. whether the task is representative or simply has testing character, also influences the user's behaviour during the test (Oppermann88, 13). Since there is a much greater variability in the definition of the test task, laboratory tests are particularly useful if the system under testing is not fully operable. Most of the bigger software houses, therefore, perform a number of laboratory tests at different stages of the software life-cycle rather than putting much effort into the design of comprehensive field tests.
In laboratory tests users need to be highly motivated in order to deliver useful test results. This is due to the fact that subjects have to invest a certain amount of time to get used to the new working environment and the demands the tests put on them. Moreover, due to the extra time needed for travel as well as for the introduction of subjects, normally less users participate in laboratory tests than in field tests.
The artificial environment in laboratory tests allow the usage of a great number of technical instruments. Well-equipped laboratories offer one-way mirrors, video and audio recording facilities as well as different logging programs. For details on logging programs see Automatic test instruments for user-oriented testing.
Cf. also (Karat90, 351) and (Crellin90, 332). In combination with these technical instruments, checklists and questionnaires have proved to be useful in order to cover the maximum amount of relevant information (VainioLarsson90, 325) and (Karat90, 351). In laboratory tests the planning phase is comparatively easy, since the test does not have to fit into any working routine. Metrics that are defined prior to the test in the test preparation phase can be enriched with additional aspects of performance when retrospectively analysing the data. The testing phase is also less complicated than in field tests, because time-management, organisation and trouble shooting are not crucial for the successful completion of the laboratory test which lays no claim to resemble normal work. Making use of various technical devices for data storage and retrieval, the data analysis phase becomes less difficult but at the same time often much more time-intensive than in its counterpart.
The costs that need to be calculated for laboratory tests are reported to be around 4 times higher than those of comparable field tests (Karat90, 352). The major factor in this calculation goes back to the very expensive maintenance of a laboratory with its various technical devices. Though evaluators in field tests invest more time in test planning and preparation, the person hours invested by evaluators in laboratory tests are comparable, since the data analysis phase is much more time-intensive than in field tests. This is mainly due to the fact that the more technical instruments are involved in data recording, the more data is available. The correlation of results from different data storage devices is both difficult and time-intensive. Moreover, most of the data is not put down during the actual testing as in field tests, but rather during a great number of retrospective analysis procedures. On the user side, normally less PHs are invested for actual testing than for organisational tasks such as the selection of subjects, travel and introduction into the new environment.
Field Test - Laboratory Test - A Comparison summarises the major differences between the two types of scenario test as discussed above.
|FIELD TEST||LABORATORY TEST|
|testing environment||normal working place||laboratory|
|- least (but still slightly) obtrusive||- controlled|
|- same physical/social environment factors||- new working environment|
|- integration of developers into tests possible|
|test task||representative integrated task||individual tasks|
|- fits into every-day routine||- possible to test specific modules only|
|- includes problems of data transfer|
|test system required||operable system or beta version||prototypes or operable systems|
|users||more users/budget||less users/budget|
|instruments||direct observation||indirect observation|
|think-aloud||- one-way mirrors|
|checklisting||- video recording|
|pre- and/or post-testing interviews||- audio recording|
test planning phase:
|test preparation||time-intensive||less time-intensive|
approx. of total PH
|test planning phase:||10||5|
|costs of test types compared||25||100|
Concluding the discussion of types of scenario test, it is important to note that experience shows that it is often very useful to combine the two types of scenario test, as Karat reports: ``By combining the use of field and laboratory tests and testing prototype and integrated code, usability staff were able to collect complementary information that together provided a more complete understanding of the end user, work contest, and panel issues and resulted in a better final interface design than would have been possible without iterative testing using different methodologies'' (Karat90, 352).
Of all test types it is the scenario test that can provide most detailed information on the quality subcharacteristics understandability, learnability and operability subsumed under usability. (ISO91a). Additionally, scenario tests can provide information on suitability, accuracy, interoperability (subcharacteristics of functionality), time behaviour, resource behaviour (subcharacteristics of efficiency), changeability (subcharacteristic of maintainability) and adaptability (subcharacteristic of portability).
When elaborating metrics for scenario tests a top-down approach is most appropriate. The evaluator then starts with a top-level item of the requirements specification and considers each of the different quality characteristics and sub-characteristics as they are provided by ISO 9126. For each characteristic which is identified as important for the particular application, the evaluator needs to describe how the system should ideally perform concerning system and data dimension. Having fixed the expected performance of the system, metrics have to be found which allow the measurement of system performance. Note that involving subjective users, most metrics in scenario tests are only indirectly quantifiable, i.e. only by involving a number of users, the results, which are in the first instance subjective qualitative data, can be statistically objectivised and quantified. Metrics that are typically reported to be applied in scenario tests are time on task (Karat90, 353), (Lewis90, 338) and (Hoge93b, 10), completion rate, error free rate, time needed for training programme (Hoge93b, 10), frequency of help/documentation use, etc. (Hoge93b, 10).
The results of scenario tests mainly need to be considered in the light of objectivity. As pointed out before, the results of neither field nor laboratory tests can be considered objective, since both go back to the participation of subjective individuals. A single person cannot be considered objective. The most commonly used techniques to reduce subjectivity in scenario tests, therefore, are to calculate averages and variances on a sufficiently large number of subjective judgements, while trying to avoid inferences with other systems. However, striving for a ``cleanroom'' approach for scenario tests by selecting test persons and subjects that do not have inferences with other systems, is dangerous, since while achieving more objectivity, the results are likely to become less representative.