Validation of tests ensures that the tests measure properties that have an effect on utility; in the case of translators' aids, important validation criteria are end user productivity and satisfaction. The customer's view may weigh these two differently from the end user's view.

A well-established method to establish the validity of some test is to rank systems according to test results and rank them a second time in actual practice and compare the rankings. This is slow and expensive. The comparison may be unreliable, both when the same subject tests the different systems and when different subjects are used, for different reasons. Also the effects of the different attribute-values and object components on the total efficiency are hard to separate.

There should be some more differentiating way to calibrate each component. Perhaps some way of tracing the behavior of the subjects (their utilization of each component - number of terms accepted from term bank, number of terms added to it, number of translations obtained from translation memory and saved into it, etc.).

Validation studies are not always of vital importance: sometimes, the `internal validity' of an attribute is clear from the start, e.g. gas consumption of cars. For many software tools, speed seems to be of uncontroversial utility.

On the other hand, it is not really known whether e.g. a translation memory actually enhances end user productivity or satisfaction at all. Here NLP software is in a somewhat different state from the kinds of objects whose evaluations are found in consumer reports.