Although the scientific impact of biomedical image analysis challenges is steadily increasing, there is surprisingly a huge discrepancy between the challenges’ impact and their quality control. In particular, challenge rankings are sensitive to a range of challenge design parameters. For example, rankings and thus the identified challenge winner may strongly depend on the chosen ranking method or on a couple of test cases. Thus, the validity and transferability of challenge results may be questioned due to possibly considerable instabilities in rankings. Yet, most publications of challenges ignore the uncertainty associated with rankings and result presentations are often limited to the ranking list and simple visualizations of the metric values for each algorithm.

Thus, the purpose of this work is to propose methodology along with an open-source framework for systematically analyzing and visualizing results of challenges. It intends to help challenge organizers and participants to gain further insights into both the algorithms' performance and the assessment data set itself in an intuitive manner. 

Visualization approaches for both challenges designed around a single task and for challenges comprising multiple tasks are presented. The proposed tools involve bootstrapping, significance testing and unsupervised learning. They allow to investigate questions such as, e.g., whether there are influential test cases, whether the winner is consistently superior to other algorithms across test cases, whether the winner is significantly superior, what range of ranks for a specific algorithm is supported by the data, which task yields clear separation of algorithms and a stable ranking and which tasks are similar with respect to their rankings.

All techniques are illustrated by synthetic and real-world assessment data.

About Manuel Wiesenfarth

Manuel Wiesenfarth is biostatistician at the German Cancer Research Center Heidelberg (DKFZ) and at Cogitars, a consulting company for innovative clinical trials. He provides statistical support in biomedical research and conducts research in statistical methodology with particular focus on Bayesian methods for clinical trials in personalized oncology. Before joining the German Cancer Research Center he received his PhD in statistics at Göttingen university and held a postdoctoral position at Mannheim university. The overarching topic of his research is the quantification of uncertainty which he believes should be an integral part of any scientific study, and specifically also in the context of validation of machine learning algorithms in medical image analysis where performance often tends to be heterogeneous across test cases and tasks. He is author of a number of publications in leading journals in statistics, radiology and urology, and of several R packages.