While embodied robotic applications have been a strong influence on moving artificial intelligence toward focussing on broad, robust solutions that operate in the real world, evaluating such systems remains difficult. Competition-based evaluation, using common challenge problems, is one of the major methods for comparing AI systems employing robotic embodiment. Competitions unfortunately tend to influence the creation of specific solutions that exploit particular rules rather than the broad and robust techniques that are hoped for, however, and physical embodiment in the real world also creates difficulties in control and repeatability. In this paper we discuss the positive and negative influences of competitions as a means of evaluating AI systems, and present recent work designed to improve such evaluations. We describe how improved control and repeatability can be achieved with mixed reality applications for challenge problems, and how competitions themselves can encourage breadth and robustness, using our rules for the FIRA HuroCup as an example.