The test collection method of evaluation is long established in the information retrieval community. Under this method, a retrieval system is evaluated by having it index a fixed document corpus, then running a set of queries against that corpus. Which documents are relevant to which queries is assessed, and then the system is scored based on the density and distribution of relevant documents retrieved. Two systems are compared by running them on the one collection and contrasting their scores. Different queries, however, have radically different levels of difficulty; one query may only have a couple of hard-to-locate relevant documents in the corpus, while another might have hundreds of easily found ones. As a result, the variability of scores is in general greater between queries than it is between systems, hindering the interpretation even of aggregate scores in isolation, and making scores incomparable between different collections. In this talk, we introduce the use of score standardization as a method for variability of query difficulty. A set of reference systems is run against the collection, as is already the practice during collection formation, and the scores of these systems against each query are used to measure the difficulty and variability of the query. Scores achieved by new systems are then standardized based on these reference systems, resulting in scores that are interpretable in themselves, and comparable even between different collections. Biographical: William Webber is a Research Associate in the Department of Computer Science and Software Engineering at the University of Melbourne, Australia. He has recently completed his PhD thesis, "Measurement in Information Retrieval Evaluation", under the supervision of Professors Alistair Moffat and Justin Zobel.