next previous contents
Next: Validation Up: Evaluation procedure Previous: Evaluation procedure

General aspects of evaluation procedure

ISO on evaluation procedures

Source: ISO 9126: 1991, section 5.3. This section repeats to a large extent the summary of the ISO report already present in chapter 2.

The evaluation procedure consists of three stages and it may be applied in every appropriate phase of the life-cycle for each component of the software product:

Quality requirement definition.

'The purpose of the initial stage is to specify requirements in terms of quality characteristics and possible subcharacteristics. Requirements express the demand of the environment for the software product under consideration, and must be defined prior to the development. As a software product is decomposed into major components, the requirements derived from the overall product may differ for the different components.'

Evaluation preparation.

'The purpose of the second stage is to prepare the basis for evaluation.' This stage consists of three components:

Evaluation procedure proper.

'The last step of the Evaluation Process Model is refined into three steps, namely measurement, rating and assessment.'

Desiderata for testing methods

Tests for evaluation should, wherever possible, have the following properties:

Test types

Attributes can be typed according to the kind of tests needed to establish their values for given objects of evaluation. Chapter 2 already described this.

In this section, we distinguish three main types of test. Test types differ wrt who does the testing (an evaluator or a translator), what data and tools are needed (program only or program and data, laboratory data or real data), and what the outputs of the test are like (quantitative or qualitative, objective or subjective-impressionistic).

Checklisting of features (= specification/inspection).

A featurization (section Towards formalisation and automation) is a hierarchical feature structure describing an object: its components, functions, attributes and values. The featurization can be based on the manufacturer's data or checked out by the evaluator. Checklisting is done by the evaluator, it does not require complicated testing procedures. Checklisting is the method of measuring boolean (presence/absence of feature) and other formal-valued attributes.

Scenario test (= users test the complete system in a realistic environment).

This involves putting the system into its intended use by its intended type of user, and recording the quality and efficiency of the work as well as user impressions. This is the natural point at which impressionistic evaluations are collected. (This is where this type of evaluation is most meaningful and reliable -- first impressions can be deceptive, the real values will only emerge in continued use.) The testing is done by the translator, it involves real data, and the results can be qualitative, including questionnaires, impressionistic keyword evaluations (easy to use, slow), free-form reports, or quantitative (statistics).

Höge et al. distinguish comparative testing (comparing different products). Direct comparing of products can be the only way to obtain useful results for attributes whose values are only defined on a comparison scale. On the other hand, it may be difficult to obtain reliable results in comparative testing, in particular to guarantee that all the systems to compare are given equal attention.

Benchmark testing (= systematic testing using test tools and materials).

A benchmark is a regimented test measuring a metric attribute. A benchmark should help determine objectively and reliably whether a function of a component achieves a given stated purpose.

Three ideas are central to benchmarking:

A benchmark is specified by specifying:

Benchmarking translation memories involves quantitative tests based on specifically prepared materials on specific, central functions on the checklist. For instance, train a translation memory with a given reference text and run the translation memory against a source text, note the output: what and how many segments are found. This is done by the evaluator, it requires controlled data, and the outputs are quantitative.

From this, in an evaluation procedure guide a list should always be included of the critical functions to evaluate, measurable dimensions of them and ranges of acceptable values on them, and methods, possibly also tools to measure them. This covers what Höge et al. call task oriented systematic testing.

It seems that benchmarking is best suited for testing the quality characteristics of functionality/accuracy and efficiency.


next up previous contents
Next: Validation Up: Evaluation procedure Previous: Evaluation procedure

ceditor@tnos.ilc.pi.cnr.it