Up: Evaluation procedure
Previous: Evaluation procedure
Source: ISO 9126: 1991, section 5.3. This section repeats to a large
extent the summary of the ISO report already present in chapter 2.
The evaluation procedure consists of three stages and it may be
applied in every appropriate phase of the life-cycle for each
component of the software product:
'The purpose of the initial stage is to specify requirements in terms
of quality characteristics and possible subcharacteristics.
Requirements express the demand of the environment for the software
product under consideration, and must be defined prior to the
development. As a software product is decomposed into major
components, the requirements derived from the overall product may
differ for the different components.'
'The purpose of the second stage is to prepare the basis for
evaluation.' This stage consists of three components:
- Quality metrics selection: 'The manner in which quality
characteristics have been defined does not allow their direct
measurement. The need exists to establish metrics that correlate to
the characteristics of the software product. Every quantifiable
feature of software and every quantifiable interaction of software
with its environment that correlates with a characteristic can be
established as a metric. [ ... ] Metrics can differ depending on the
environment and the phases of the development process in which they
are used. Metrics used in the development process should be
correlated to the user respective metrics, because the metrics from
the user's view is crucial.'
- Rating levels definition: 'Quantifiable features can be measured
quantitatively using quality metrics.' The result, the measured
value, must be interpreted as a rated value, i.e. 'divided into
ranges corresponding to the different degrees of satisfaction of the
requirements. Since quality refers to given needs, no general levels
for rating are possible. They must be defined for each specific
- Assessment criteria definition: 'To assess the quality of the
product, the results of the evaluation of the different
characteristics must be summarized. The evaluator has to prepare a
procedure for this, using, for instance, decision tables or weighted
averages. The procedure usually will include other aspects such as
time and cost that contribute to the assessment of quality of a
software product in a particular environment.'
'The last step of the Evaluation Process Model is refined into three
steps, namely measurement, rating and assessment.'
- Measurement: 'For measurement, the selected metrics are applied
to the software product. The result is values on the scales of the
- Rating: 'In the rating step, the rating level is determined for
a measured value [ ... ]'
- Assessment: 'Assessment is the final step of the software
evaluation process where a set of rated levels are summarized. The
result is a statement of the quality of the software product. Then
the summarized quality is compared with the other aspects such as
time and cost. Finally managerial decisions will be made based on
the managerial criteria. The result is a managerial decision on the
acceptance or rejection, or on the release or no-release of the
Tests for evaluation should, wherever possible, have the following
- reliable: i.e. stable under repetition and under
irrelevant changes of the context of the measurement like the person
who applies it. There are well-known methods to establish
reliability (e.g. split-half).
- valid i.e. the measurement values obtained inform us about
the actual utility of the object of evaluation. As an example, one
may try to infer, for some battery of tests, what the measurement
results imply for the productivity of end users. In practice, a
good aim is already the establishment of some correlation between
measurement and productivity.
- efficiently applicable; especially, it is desirable that
the measurements can be taken without involving users directly.
Users do not want to be bothered by performing evaluations, they
want others to do it for them. Users play a role in the
validation procedure, precisely with the intention to eliminate
them from later evaluations. However, validation will probably
always remain somewhat problematic, so evaluations will probably
always involve some degree of user activity
- producing values that are
- formal enough to serve as a basis for comparison amongst
alternative members of the class of objects of evaluation under
- mappable to utility e.g., measuring the weight of some
object of evaluation should only happen if it is clear how weight
relates to utility; this is already
a kind of a priori validity consideration
Attributes can be typed according to the kind of tests needed to
establish their values for given objects of evaluation. Chapter 2
already described this.
In this section, we distinguish three main types of test. Test types
differ wrt who does the testing (an evaluator or a translator), what
data and tools are needed (program only or program and data,
laboratory data or real data), and what the outputs of the test are
like (quantitative or qualitative, objective or
(section Towards formalisation and automation)
is a hierarchical feature
structure describing an object: its components, functions, attributes
and values. The featurization can be based on the manufacturer's data
or checked out by the evaluator. Checklisting is done by the
evaluator, it does not require complicated testing procedures.
Checklisting is the method of measuring boolean (presence/absence of
feature) and other formal-valued attributes.
This involves putting the system into its intended use by its intended
type of user, and recording the quality and efficiency of the work as
well as user impressions. This is the natural point at which
impressionistic evaluations are collected. (This is where this type of
evaluation is most meaningful and reliable -- first impressions can
be deceptive, the real values will only emerge in continued use.) The
testing is done by the translator, it involves real data, and the
results can be qualitative, including questionnaires, impressionistic
keyword evaluations (easy to use, slow), free-form reports, or
Höge et al. distinguish comparative testing (comparing different
products). Direct comparing of products can be the only way to obtain
useful results for attributes whose values are only defined on a
comparison scale. On the other hand, it may be difficult to obtain
reliable results in comparative testing, in particular to guarantee
that all the systems to compare are given equal attention.
A benchmark is a regimented test measuring a metric attribute. A
benchmark should help determine objectively and reliably whether a
function of a component achieves a given stated purpose.
Three ideas are central to benchmarking:
- benchmarking is standardization in the first place: a benchmark
should be applicable to a class of objects and yield results of a
sufficiently formal nature to be comparable.
- a benchmark has to be valid w.r.t. users' interests.
- a further desideratum for a benchmark is that it can be applied
efficiently; ideally, it should not involve end users at all, but be
applied by specialized benchmarkers and to a large degree be
A benchmark is specified by specifying:
- the attribute(s) it measures
- the tools and data needed to run the benchmark
- the requirements on time, manpower and experience
- the procedure to follow to run the benchmark
- the interpretation of the results of the benchmark
Benchmarking translation memories involves quantitative tests based on
specifically prepared materials on specific, central functions on the
checklist. For instance, train a translation memory with a given
reference text and run the translation memory against a source text,
note the output: what and how many segments are found. This is done by
the evaluator, it requires controlled data, and the outputs are
From this, in an evaluation procedure guide a list should always be
included of the critical functions to evaluate, measurable dimensions
of them and ranges of acceptable values on them, and methods, possibly
also tools to measure them. This covers what Höge et al. call task
oriented systematic testing.
It seems that benchmarking is best suited for testing the quality
characteristics of functionality/accuracy and efficiency.
Up: Evaluation procedure
Previous: Evaluation procedure