Black box testing implies that the selection of test data as well as the interpretation of test results are performed on the basis of the functional properties of a piece of software. Black box testing should not be performed by the author of the program (Thaller94, 92) who knows too much about the program internals. In new testing approaches, software systems are given a third external party for black box testing after having successfully finished the internal glass box testing exercises.
Though centered around the knowledge of user requirements, black box tests do not necessarily involve the participation of users. Among the most important black box tests that do not involve users are functionality testing, volume tests, stress tests, recovery testing, and benchmarks (Thaller94, 138-149). Additionally, there are two types of black box test that involve users, i.e. field and laboratory tests (Karat90, 352) and (Crellin90, 330). In the following the most important aspects of these black box tests will be described briefly.
The so-called ``functionality testing'' is central to most testing exercises. Its primary objective is to assess whether the program does what it is supposed to do, i.e. what is specified in the requirements. There are different approaches to functionality testing. One is the testing of each program feature or function in sequence (Musa87, 521). The other is to test module by module, i.e. each function where it is called first (Thaller94, 138).
The objective of volume tests is to find the limitations of the software by processing a huge amount of data (Thaller94, 139). A volume test can uncover problems that are related to the efficiency of a system, e.g. incorrect buffer sizes, a consumption of too much memory space, or only show that an error message would be needed telling the user that the system cannot process the given amount of data.
During a stress test, the system has to process a huge amount of data or perform many function calls within a short period of time. A typical example could be to perform the same function from all workstations connected in a LAN within a short period of time (e.g. sending e-mails, or, in the NLP area, to modify a term bank via different terminals simultaneously).
The aim of recovery testing is to make sure to which extent data can be recovered after a system breakdown. Does the system provide possibilities to recover all of the data or part of it? How much can be recovered and how? Is the recovered data still correct and consistent? Particularly for software that needs high reliability standards, recovery testing is very important.
The notion of benchmark tests involves the testing of program efficiency. The efficiency of a piece of software strongly depends on the hardware environment and therefore benchmark tests always consider the soft/hardware combination (Thaller94, 147). Whereas for most software engineers benchmark tests are concerned with the quantitative measurement of specific operations (Thaller94, 146), some also consider user tests that compare the efficiency of different software systems as benchmark tests (Lewis90, 337-343). In the context of this document, however, benchmark tests only denote operations that are independent of personal variables.
For tests involving users, methodological considerations are rare in SE literature. Rather, one may find practical test reports that distinguish roughly between field and laboratory tests (Karat90), (Crellin90) and (Moll88). In the following only a rough description of field and laboratory tests will be given. For details see Scenario Tests.
In field tests users are observed while using the software system at their normal working place. Apart from general usability-related aspects, field tests are particularly useful for assessing the interoperability of the software system, i.e. how the technical integration of the system works. Moreover, field tests are the only real means to elucidate problems of the organisational integration of the software system into existing procedures. Particularly in the NLP environment this problem has frequently been underestimated. A typical example of the organisational problem of implementing a translation memory is the language service of a big automobile manufacturer, where the major implementation problem is not the technical environment, but the fact that many clients still submit their orders as print-out, that neither source texts nor target texts are properly organised and stored and, last but not least, individual translators are not too motivated to change their working habits.
Laboratory tests are mostly performed to assess the general usability of the system. Due to the high laboratory equipment costs laboratory tests are mostly only performed at big software houses such as IBM or Microsoft. Since laboratory tests provide testers with many technical possibilities, data collection and analysis are easier than for field tests.
To conclude, apart from the above described analytical methods of both glass and black box testing, there are further constructive means to guarantee high quality software end products. Among the most important constructive means are the usage of object-oriented programming tools, the integration of CASE tools, rapid prototyping, and last but not least the involvement of users in both software development and testing procedures (Thaller93, 157).
The above survey of glass and black box testing methods has shown that although considerable work on development-oriented testing practices exists, there is much less work to date on user-oriented evaluation methods. We hope that this report will go some way to fill that lacuna.