In the testing phase, there are two major kinds of testing instruments, i.e. those that ask for a manual collection of data and those that perform automatic data collection.
The following survey will concentrate on describing the most important characteristics of the most prominent manual testing instruments - questionnaires, checklists, interviews, observations, think-aloud protocols - as they could be collected from a great number of practical test reports.
Questionnaires are frequently used for all phases of software development and evaluation (Oppermann88, 10). Questionnaires are used to elicit both quantitative and qualitative data. For qualitative data often numerical rating scores are used that lead to an on the first sight ``objective'' numerical result (Crellin90, 333) and (Oppermann88, 10). The reliability of results arrived at by means of rating scales used in questionnaires strongly depends on the adequacy of the rating scale, and on both the number and representativity of persons questioned (Falkedahl91, 25). Another often cited problem of questionnaires is that they are likely to deliver only those results that are welcome to the designers of the questionnaire, i.e. the choice of questions biases the results. This is due to the fact that the way of posing questions may implicitly suggest the ``correct'' answer (Oppermann88, 10). Thus both validity and informativeness of questionnaire results are a matter of the adequacy of the questions and rating scales (King90, 25).
Whereas paper questionnaires are still prominent when performing large scale surveys, for testing purposes the interactive on-line questionnaire is steadily gaining importance (Rushinek85, 250) and (Crellin90, 333). Interactive on-line questionnaires are linked to the software and at critical points in the performance of the program, a question pops up which the subject has to answer directly. Being linked to the program, it is impossible to use the software without answering the questions posed. The responses are then stored automatically and statistical calculations performed over the whole range of responses of different subjects (Rushinek85, 251). On-line interactive questionnaires are sophisticated means to elicit information particularly on user-related quality characteristics such as usability, or functionality. This is due to the fact that the questions are posed directly when the user has performed the critical function, i.e. at the very moment when the impression of the software on the user is at its peak. At the same time interaction questionnaires disrupt the normal work flow of the user and thus the findings are related to the usability and functionality of individual modules rather than to the overall system.
The elaboration of on-line interactive questionnaires from scratch is time-consuming and presupposes a certain knowledge of object oriented programming. There are different shells that ease the elaboration of questionnaires mostly including a system usage log, interactive on-line questionnaire (plus rating scales) and data analysis tools. The setting up of an on-line interactive questionnaire by the aid of these shells is reported to need little expertise, time and money. Combining the use of different test instruments, a great amount of data can be collected. Having correctly set up the system, subjects can perform the test on their own, without the need of an evaluator being always present, which reduces the personnel costs and eases the test planning phase to a large extent, since no dates for tests have to be organised with a number of users who are able to perform the tests whenever they like. However, a precondition for the successful application of this type of on-line interactive questionnaire is that users are clearly instructed, i.e. need to understand exactly what they are supposed to do and how they are supposed to respond (Crellin90, 333). The analysis of the data arrived at by means of the PROTEUS shell needs a certain experience with the elicitation technique. The system performs computer aided cluster analysis and graphical display of the numerical ratings and textual construct labels used in the questionnaire.
Apart from on-line interactive questionnaires, debriefing questionnaires (debriefing questionnaires can be both paper or on-line) are still used in post-testing interviews after performing a scenario test (Karat90, 353), cf. also Interviews. They generally elicit how the performance of the software was judged, what changes need to be made to the software, what was confusing, what was particularly helpful etc. They are often used in combination with post-testing interviews which are performed after the analysis of the questionnaires and which take up those aspects that need further clarification.
Checklists are frequently used for manual software tests, particularly for any kind of inspection. Generally the aim of using checklists is ``... to obtain a concise and coherent description of the system in terms of objects, attributes, functions, relations between objects as well as between objects and functions, dialogue states, selections and estimated usability'' (VainioLarsson90, 325).
Checklists can deliver numerical, boolean and classificatory values. There are various rating techniques applied in checklists, the most important of which are availability rating and performance rating (Athappily86, 15-21). The most frequently used rating technique in checklists is availability rating, for which only the boolean values of yes/no (or true/false) are used. A more comprehensive rating score, normally with 5-7 items, is used for performance rating. An often used performance rating score covers the following 5 items (Athappily86, 15) :
0 feature not available 1 feature partially available 2 feature limited or fair 3 feature complete or good 4 feature very good or superior
For quality characteristics such as usability, understandability, learnability, etc. the performance scale would be adjusted, i.e. instead of ``available'' the scale would take up ``usable'', ``understandable'' etc.
Ideally a checklist used for testing purposes should take up the metric that is being applied, the measurement technique, and, most importantly, the related quality characteristic. A sample checklist for the inspection of a Translation Memory that takes up the quality characteristics inter-operability and understandability is shown in Sample Checklist for Translation Memory Inspection.
A comprehensive inspection checklist should tackle every quality characteristic that is of importance for the particular test case. A great number of aspects that have to go into an inspection checklist can be derived from the user profile or system specification. Other quality characteristics and metrics, such as many of those for usability are of general importance and are thus not the subject of any specification document. They are not application specific and therefore, once elaborated, can be taken up in many inspection checklists equally.
Each checklist must be adapted to the specific test type. Apart from inspection, checklists are needed for observations carried out in the framework of scenario tests (Hoge91a, A-III) and (Hoge92, A-4). Experience proved that for scenario checklists, it is most adequate to use a table format that has at least the following columns:
|inter-operability||integration of program into text processing||possible?||Y/N|
|integration of other data resources (e.g. term bank)||possible?||Y/N|
|sharing of databases etc.||possible?||Y/N|
|how many databases available?||number|
|understandability||sequence of windows/actions easy to understand?||understandable?||0 - 1 - 2 - 3 - 4|
|names of buttons/iconics/menus easy to understand? etc.||understandable?||0 - 1 - 2 - 3 - 4|
Effective direct observation depends to a large extent on the suitability of the checklist. Thus the checklist needs to be well organised, providing the possibility to take up every item that relates to those quality characteristics of interest. At the same time a scenario checklist needs to be flexible enough to follow unexpected user behaviour. Whereas it is pretty easy to fill in inspection checklists, effective checklisting in scenario tests is very difficult. This is mostly due to the fact that the observer has to do two things at the same time, i.e. observing and noting. Thus it is advisable to perform pilot observations with a draft checklist before actually entering a test, in case the observer has no checklisting experience and/or the appropriateness of the checklist has not been tested before.
Interviews are part of most scenario tests. In an interview one or more interviewees are questioned by one or more interviewer(s). When aiming at the elicitation of very critical, personal data, it is advisable to perform personal interviews involving only one interviewee and one interviewer, whereas for more general data elicitation, a group of interviewees is likely to stimulate the readiness of the individual to provide detailed answers and thus will inevitably increase the amount of data.
When planning an interview, one has to decide when and how the interview should take place. Concerning the point in time at which interviews are typically performed, one may distinguish between pre- and post-testing interviews. Pre-testing interviews are performed in order to elicit the subjects' personal backgrounds, opinions and expectations concerning the system that is going to be tested (Moll88, 73) and (Crellin90, 330). The information gained by means of pre-testing interviews gives valuable hints when interpreting the scenario test results. Post-testing interviews are an important, if not necessary, part of each scenario test (Hoge91a, 22) and (Crellin90, 330). They are performed after the observational data, i.e. on video tapes or checklists, is analysed. Each aspect that needs further clarification is taken up in the post-testing interview. When performed in conjunction with video observation or think-aloud, the behaviour and comments of the subjects in particular situations can be discussed with the subject and analysed jointly. A combination of both pre- and post-testing interview is particularly useful, since it allows the assessment of the change of mind of subjects during the testing exercise. Cf. (Moll88, 73) who reported that at the beginning of a test, attitudes towards the usefulness of help systems were quite positive, while at the end, after having used the help system various times, it was much more negative.
There are various possibilities to perform interviews. A basic distinction which stems from knowledge engineering is between focused and structured interviews (Fulford90, 14). In a focused interview, the interviewee is prompted with a question related to his/her working environment, i.e. typical tasks, problems etc. and his/her general opinion towards the system under testing. The interviewee is thereafter given the opportunity to express him/herself freely while being interrupted as little as possible (Fulford89, 16) and (Crellin90, 330). ``The principal aim of the focused interview is to obtain a typology of objects and agents in the domain, to establish basic factual knowledge, and to achieve a breakdown of the problem'' (Fulford89, 17).
The structured interview is used for obtaining detailed information on specific topics which arose from testing. During the structured interview, the interviewer often follows prepared checklists, hands out debriefing questionnaires which discuss specific topics related to the software (cf. (Karat90, 535) and questionnaires), shows a series of storyboards to enable the interviewee to visualise possibilities of the screen layout (Fulford90, 14), or distributes multiple-choice tests, eliciting, for instance, the understandability of help messages etc (Moll88, 73).
Ideally, interviews are audio recorded, which eases later data analysis. The major advantage of interviewing lies in the fact that, unlike e.g. interactive on-line questionnaires and observations, it ``...does not interfere with the processes as they take place.'' (Crellin90, 330). Moreover it allows to elicit information on important aspects such as the organisational constraints of integrating the software into the every-day work etc. Also, the post-testing interview is the only means to verify or falsify the interpretations made from data which was collected by the aid of other instruments such as observations or questionnaires and thus is an important means to ease data analysis, since the probability of incorrect data interpretation is decreased. On the other hand, however, the major disadvantage of interviews lies in the fact that the lack of anonymity in personal interviews may drive interviewees to suppress important information willingly. Or, on the part of the interviewer, bias, inexperience or fatigue may distort the data (Crellin90, 330), while on the part of the interviewee ``post-hoc rationalisation may occur and conceal evidence of the actual processes that took place'' (Crellin90, 330).
Despite all criticism on the subjectivity of interviews as instrument for data collection, it has to be noted that when performed in combination with other instruments such as, above all, observations and questionnaires, the role of interviews should not be underestimated.
Observations are the most important instrument in any kind of software test involving users. Observations can deliver results on all user-related quality characteristics such as understandability, learnability, operability, task adequacy, task-relevance, usability, comprehensibility, error-tolerance, consistency etc. as well as on the functionality and the efficiency of the software system (Oppermann88, 11).
One may distinguish between direct and indirect observation. When performing direct observation one or more evaluator(s) sit close to the subject, while watching and taking notes on prepared scenario checklists (Karat90, 330), (Hoge91a, 22), (Hoge92, 8) and (Hoge93b, 10). As pointed out before, the success of direct observation depends to a large extent on the appropriateness of the checklist and the experience of the evaluator. When interpreting the results of direct observations, one has to keep in mind that it is very difficult to observe users without intruding, which often alters the nature of the interaction with the system (Crellin90, 330). Thus direct observation should always be combined with post-testing interviews which allow the discussion of user behaviour. However, even if combined with checklists and interviews, direct observation will mainly provide results for those metrics that could be pre-defined, i.e. those aspects to which the observer paid attention and which could be noted on the checklist.
For indirect observation either video recording or one-way mirrors are used. Though video recording is not as intruding as direct observation, it still has a certain effect on the behaviour of the user. Thus, video recording is particularly useful in combination with other test instruments such as think-aloud and interviews. Think-aloud and video allow the development of the user's mental model of the software (Moll88, 74). This is particularly interesting, if there is a clash between user expectations and the actual performance of the system, something which is often the case for conceptually new software as in the NLP area. In the TWB project this experience was made concerning the users' attitude towards style checkers, towards which the subjects showed an extremely negative attitude before testing, which was due to their expectations of the system actually criticising their personal style. Not knowing what a computer is capable of doing, users are often disappointed if the functionality of the system is more limited than what they expected.
Post-testing interviews are also useful, particularly when they include the re-play of video segments that posed problems in the data analysis phase. Subjects are then asked to watch their behaviour in ``critical'' situations and comment on their behaviour and problems encountered. Cf. (Moll88, 74) who call this ``Selbstbeobachtung aus der frischen Erinnerung''.
The major advantage of video recording is that it allows reviewing the data as often as necessary for thorough data analysis. However, viewing the tapes several times, identifying interesting segments that can provide information on those quality characteristics that are important for the particular case, reviewing these segments and capturing the details, is a considerable amount of work (VainioLarsson90, 325). Thus while video recording asks for little time in the test preparation phase, the data analysis phase is n times longer than with direct observation (Crellin90, 332). Accordingly the expenses that have to be planned when integrating video observation have to consider both additional equipment and high personnel costs.
One-way mirror observation is only used in well-equipped laboratories (Karat90, 353). It is the least obtrusive way of observing subjects. Similar to direct observation, one-way mirror observation has to be combined with a scenario checklist on which to note user behaviour. Concerning data analysis the same is true as for direct observation, i.e. that mostly only those aspects are captured on which the observer has an eye during observation.
Observations need to be performed in conjunction with other forms of data collection, since (i) important ``internal'' aspects are not available for data interpretation, and (ii) the detailed interaction with the system cannot be completely covered. Complemented by interviews or think-aloud as well as logfile recording, the quality of results increases considerably.
Think-aloud protocols are used in many empirical investigations. It is a means of qualitative data collection. The motivation behind using think-aloud protocols is to collect information on the users' own reasons for their behaviour (VainioLarsson90, 325), (Moll88, 74) and (Crellin90, 331). The data collected in think-aloud protocols needs to be evaluated carefully, since thinking aloud presupposes that users are able to describe their actions, which is only true for users trained to verbalise their thoughts (VainioLarsson90, 325). Also, what users are able to verbalise, represents only the conscious part of their thoughts and thus neglects important subconscious aspects (Honig82, 82). Another problem of applying think-aloud protocols is that it may have a negative effect on the user behaviour as it is recorded by direct or indirect observation (Crellin90, 330). This is due to the fact that it is even more intrusive than pure observation, (Crellin90, 330), and, moreover, ``... many users have difficulty in acting and reflecting simultaneously.'' (VainioLarsson90, 325).
Due to the various problems related to think-aloud protocols, in testing practice think-aloud protocols are only used as a complementary method to ease data interpretation. As such they are valuable, as long as the related context and problems are considered in data interpretation.
Most automatic test instruments are developed for specific types of application, e.g. to perform benchmark tests for translation memory programs and the like. For details on NLP toolkits see (Galliers93, 129). The aim of this chapter, however, is not to describe individual toolkits but rather to provide the reader with some information on generally usable automatic test instruments.
Those automatic test instruments that can be used with different types of applications in user tests are mostly geared to support data collection. They elicit both qualitative and quantitative data and are useful supplements to manual test instruments in user-oriented software testing. They are additional data collection instruments which are in the first instance rich in data and do not intrude on the user's thoughts or activities (Neal85, 1052) and (Crellin90, 335). They document the actual user behaviour on the system, which, in the analysis phase, can be compared to what the user thinks he/she was doing (Crellin90, 331).
There are many different names that denote two major types of tool, i.e. (i) logging programs that time-stamp and record the user-interaction with the system into files that can be printed and analysed after test completion, cf. ``interaction logging'' in (VainioLarsson90, 325) and in (Karat90, 353), ``keystroke level model'' in (Oppermann88, 11) and in (Crellin90, 333), ``system usage log'' in (Crellin90, 333), and (ii) playback programs that record user-interaction and provide playback facilities for later analysis, cf. ``playback methodology'' in (Neal85, 1052), ``logfile recording'' in (Moll88, 73), ``interaction log'' in (Crellin90, 353). The major difference between the tools, therefore, does not lie in their aim but rather in the way they support the data analysis phase. Whereas the data of logging programs ask for a ``manual'' analysis of the interaction data, playback programs actually show the whole testing session on a second computer (Neal85, 1054).
Recording not verbalised operations, i.e. all keystrokes and mouse activities, including incorrect inputs (VainioLarsson90, 325), the data provides useful information on quality characteristics related to the usability and functionality of the software. For instance the frequency of use of a certain function within several testing sessions gives some hints on the task-relevance of the function (Moll88, 73), (Hoge93b, 10) and (VainioLarsson90, 325), the occurrence of cumulative handling errors of users provide information on the understandability as well as on the learnability of the function (Moll88, 73) and (Hoge93b, 10). The suitability of the help function can be assessed from the number of cases in which, after the consultation of help, solutions were found etc. (Moll88, 74).
Logging and playback programs are general data collection programs that are external to the software under testing, i.e. no changes are necessary from one application to another (Neal85, 1052). It can be used with actual product code or prototypes of the user interface of a product under development (Neal85, 1052). The application of toolkits for logging programs which are available on the PC market generally requires intimate knowledge of the systems under testing; since the interfaces between the application and the logging program need to be specified (Crellin90, 331).