In the section discussing ISO 9126 and our extensions to it, we have already pointed out that the validity of measures is a critical issue: if the measures used are not valid, the evaluation as a whole is worthless. Where there has been controversy about past evaluations, it is often the validity of the measures or of the methods used to obtain a measurement which have been called into question. Given the importance of this point, we shall recapitulate some of the earlier discussion here.
Earlier, we said that validity could be defined to be either internal validity or external validity. Let us look at these in turn.
We saw earlier that internal validity was achieved by making sure that what was measured reflected an appropriate attribute of the object to be evaluated. We can restate this here by saying that internal validity reflects the degree to which the measure directly represents the interest of the user, taking into account both the user himself and the context in which he works. An intuitively straightforward example of internal valdity in the case of spelling checkers is offered by an attribute which takes as its value the language the spelling checker deals with. It is quite hard to imagine a spelling checker for Greek being evaluated as useful for someone who had to deal with Italian text.
Ensuring internal validity is, in general, rather more difficult to achieve. It relies on the judgement of experts, in our case, the judgement of those who design the evaluation, and can only be justified after the event in the light of feedback from the customers of evaluations or from other interested outsiders.
External validity, as it was defined earlier, is achieved by demonstrating a correlation between a measure and some external criterion. It is rarely worked out formally (by, for example, actually calculating a coefficient of correlation) in the case of evaluation design, but is often used informally to justify the choice of a measure. Our earlier example of the size of dictionary used by a spelling checker affords just such an example. The size of the dictionary is of interest because, given our knowledge of how spelling checkers currently work, we believe that a small dictionary means that many false positives will be flagged, whereas a large dictionary means that (relatively) few false positives will be flagged: dictionary size correlates with signalling of false positives.