The development of segmentation algorithms for different anatomical structures and imaging protocols is an important task in medical image processing. The validation of these methods, however, is often treated as a subordinate task. Since manual delineations, which are widely used as a surrogate for the ground truth, exhibit an inherent uncertainty, it is preferable to use multiple reference segmentations for an objective validation. This requires a consistent framework that should fulfill three criteria: 1) it should treat all reference masks equally a priori and not demand consensus between the experts; 2) it should evaluate the algorithmic performance in relation to the inter-reference variability, i.e., be more tolerant where the experts disagree about the true segmentation; 3) it should produce results that are comparable for different test data.


We show why current state-of-the-art frameworks as the one used at several MICCAI segmentation challenges do not fulfill these criteria and propose a new validation methodology. A score is computed in an adaptive way for each individual segmentation problem, using a combination of volume- and surface-based comparison metrics. These are transformed into the score by relating them to the variability between the reference masks which can be measured by comparing the masks with each other or with an estimated ground truth. We present examples from a study on liver tumor segmentation in CT scans where our score shows a more adequate assessment of the segmentation results than the MICCAI framework.