Manual delineations by experts are often used as reference standards for validating segmentation algorithms, although it is well known that they always show some degree of variability. Our goal is to estimate the effects of using a limited number of expert segmentations. Given ten manual delineations of 13 liver tumors, we analyzed the volume error made by randomly selecting subsets of the ten segmentations compared to the complete set. 

We found that when using just one segmentation the expected error was 17% with a maximum of 35%. This means that it is questionable whether a comparison with a single reference allows a reliable validation. When three segmentations are chosen, the error is halved, so this might be a reasonable compromise between accuracy and viability of evaluation studies.