Lessons learned from organizing UNICORN

UNICORN was a public challenge held at MICCAI 2025 to evaluate multimodal foundation models (FMs) across 20 tasks spanning computational pathology, radiology, and medical language processing. The idea was simple: take a single model, adapt it to a diverse set of clinically meaningful tasks with minimal supervision, and see how far it generalizes.

Running it was anything but simple. UNICORN was a team effort, and what follows are my personal reflections on the experience.

Why we built it

Foundation models promise broad adaptability — across diseases, modalities, and tasks — with minimal supervision. But rapid FM development has outpaced our ability to measure and compare capabilities. What we have instead is a patchwork of fragmented, ad hoc benchmarks, each with its own protocols, datasets, and evaluation criteria. This makes fair, transparent, and reproducible comparison nearly impossible.

The problem is well-documented: inconsistent protocols, gated datasets, tasks of low clinical relevance. And the shift from task-specific to open-ended models only makes it worse — single-task challenges can no longer tell us whether a model is genuinely capable or just well-optimized for one narrow problem.

We believe clinically grounded data challenges are the right format. But existing challenges are too narrowly scoped — one organ, one task, one modality — to say anything meaningful about FM adaptability.

Design principles

We distilled the challenge requirements into 10 core design principles:

#	Principle	What it means
1	data	High-quality, diverse datasets spanning institutions, populations, and acquisition protocols
2	scope	Multi-task, multi-modal, covering clinically meaningful problems
3	adaptation-aware	Explicitly test how efficiently models adapt under limited supervision
4	reproducibility	Fixed, transparent evaluation protocols ensuring fair comparisons
5	trustworthiness	Sequestered test sets, inaccessible to participants
6	scalability	Fully automated execution across tasks and modalities
7	community-driven	Open submissions, open evaluation protocols
8	compute	Access to high-grade GPUs for running modern FMs
9	cost	Balance accessibility with efficiency
10	privacy	Protect proprietary model weights and sensitive data

With hindsight, robustness and generalization should have been the 11th. Evaluation should probe sensitivity to distribution shift and real-world variability — both in how data is collected (diverse institutions, multiple populations) and in the metrics reported. We did not design for this explicitly, and I think it shows in how results should be interpreted.

From principles to practice

These principles look ambitious on paper, and they are. Implementing all 10 simultaneously requires both technical infrastructure and organizational buy-in that few groups can assemble from scratch.

We had a head start: we built on Grand Challenge, a mature cloud platform (AWS) already handling secure dataset hosting, standardized evaluation pipelines, automated execution, and community participation. Most of the hard infrastructure was already there. Our contribution was to extend it — collaborating closely with the Grand Challenge team to enable properties that were missing for a medical FM benchmark.

The core workflow is container-based. Participants submit a Docker container that runs model inference on unseen inputs (Step 1). A separate organizer-provided container then evaluates predictions against sequestered ground truth labels and produces leaderboard metrics (Step 2).

A key architectural decision: freeze the FM

Our original idea was to prompt the FM to solve each task using a few labeled examples — much like LLM prompting. In practice, this does not transfer to vision: most vision FMs are non-generative encoders, not generative models, so prompt-based few-shot is not directly applicable. Full fine-tuning on a handful of labeled examples is also unrealistic for clinical deployment.

This led to a more principled design: vision FMs are frozen during Step 1, acting purely as feature encoders with no access to labels. In Step 2, lightweight task-specific adaptation heads are fitted using labeled few-shots to map FM features to case-level predictions.

The goal was not to maximize per-task performance — that would just reward engineering effort — but to enable a fair comparison of the intrinsic representation quality of each FM across diverse clinical problems. Task-specific adaptors are also open-sourced via a shared evaluation repository, enabling open inspection and reuse across teams.

What we learned

1. The field is not ready for a single multi-modal FM. Most participants still use separate models for different modalities. The closest anyone got was a single vision model covering both pathology and radiology — but not a unified model spanning vision and language.

2. Few-shot dense prediction is an open problem. For pixel-level or region-level tasks, adapting from just a few labeled examples remains genuinely difficult. This wasn’t a bottleneck we solved — it’s an open research problem the challenge surfaced clearly.

3. The winning approach combined SSL and multi-task supervised learning. The top-ranking submission did not rely solely on a self-supervised pretrained model. It combined self-supervised learning with multi-task supervised training — suggesting that pure SSL pretraining alone is not yet sufficient for broad clinical generalization.

4. Balancing flexibility and scalability is harder than it sounds. Allowing participants to submit arbitrary containers gives flexibility, but it creates friction: file format inconsistencies, security validation overhead, imposed resource limits, and limited control over algorithmic choices all become real problems at scale.

5. Running such a challenge is costly. Compute costs for executing and evaluating dozens of FM submissions across 20 tasks add up quickly. This is a real barrier to sustainability.

6. Preparing data and labels is tedious but essential. Clean, clinically annotated datasets are the foundation of everything. This work is rarely glamorous and consistently underestimated.

A broader perspective

UNICORN is a first step, setting the stage for the community to iterate and refine. But I’ll be direct: incremental benchmarks alone are unlikely to be sufficient to meaningfully impact either scientific progress or patient care.

To achieve that level of impact, the field needs to move toward truly global, consortium-driven evaluation efforts — bringing all stakeholders into the loop: hospitals and clinicians who curate data and interpret results; academic researchers and industry leaders who develop models; statisticians who design robust protocols; cloud providers who supply infrastructure; patients whose data underpins everything; and critically, regulatory and certifying bodies whose early participation is essential if at least a subset of benchmark evaluations is to be explicitly designed to meet regulatory requirements.

The long-term goal should be a shared platform or data registry — a de facto reference benchmark against which progress in medical AI can be measured and, eventually, translated into clinical practice.

I also find myself increasingly skeptical of headline performance numbers reported in isolation. Seemingly minor differences in data curation, preprocessing, cohort selection, or evaluation protocol can lead to large and opaque discrepancies in reported results. What the field needs are common, trusted benchmarks that function as stable scientific anchors — analogous to ImageNet in computer vision — so that progress reflects genuine advances rather than fragile experimental setups.

UNICORN is not that anchor yet. Its implementation on Grand Challenge was pragmatic but constraining (limited control over submissions, limited scalability). The field is also already moving beyond static models toward agentic systems, which will likely require different evaluation workflows entirely. Despite these limitations, the core design principles explored in UNICORN are broadly applicable to medical AI as a whole — not just to foundation models.

UNICORN is best viewed as an early structural experiment: a signal of where the field needs to go if it is to move from fragmented performance claims toward reliable, clinically meaningful progress.