Recent work on robot manipulation has advanced policy generalization to novel scenarios. However, it is often difficult to characterize how different evaluation settings actually represent generalization from the training distribution of a given policy. Towards more precise evaluation of generalization in robotics, we propose RADAR, a scalable framework for directly comparing test-time evaluation tasks to policy training data, to determine what form of policy generalization is required. RADAR consists of a two-stage pipeline: first, retrieval using generalist policy embeddings identifies which training examples are relevant for a given evaluation task. Next, vision-language models (VLMs) analyze the evaluation task against the retrieved data, outputting interpretable analysis on how they compare along a variety of axes, and an overall classification of what type of policy generalization is required. Through controlled experiments, we demonstrate that VLMs are effective at analyzing data for generalization, and that our retrieval step effectively identifies examples needed to make accurate classifications with respect to the training data. Furthermore, we scale RADAR to large-scale datasets, where we observe agreement with human-defined benchmark conditions from prior work.
RADAR determines what form of generalization a robot policy requires at evaluation time by comparing evaluation scenarios directly to training data — using VLA embeddings for retrieval and a VLM for interpretable analysis.
To scale to large datasets, RADAR first uses embedding-based retrieval to identify the $k$ nearest neighbors of the evaluation scenario $\tau^\text{test}$ in the training data $\mathcal{D}$. We propose using internal representations from generalist vision-language-action (VLA) policies as embeddings.
VLA embeddings inherit two key properties: visual and semantic invariance from VLM pretraining, and behavioral sensitivity from large-scale robot data training. These properties are important in our framework to identify examples needed for accurate downstream analysis (see paper for details).
Given $\mathcal{D}_\text{retrieval}$, a VLM analyzes each retrieved example against $\tau^\text{test}$ along seven generalization axes drawn from the $\bigstar$-Gen taxonomy: Image Augmentations, Visual Task Object, Visual Scene, Object Poses, Morphed Objects, New Object, and Interacting Scene.
For each retrieved example, the VLM predicts whether each axis is in-distribution and classifies the example as same, a visual perturbation, or a behavioral perturbation. These predictions are aggregated to give a final classification: in-distribution, visual generalization, or behavioral generalization.
We evaluate RADAR on three task families on the ALOHA 2 platform, with over 2,300 annotated task variations that are unseen by the VLAs whose embeddings we use.
We evaluate how recall of relevant training examples scales with retrieval set size. We compare VLA-based embeddings ($\pi_0$, $\pi_{0.5}$, Gemini Robotics On-Device/GROD) against baselines that do not use robot data (PaliGemma 2, SigLIP 2, DINOv3).
VLA embeddings — especially GROD — exhibit larger distances for behavioral perturbations (blue) than visual perturbations (orange), a key property enabling effective retrieval. Baselines that do not use robot data lack this structure.
We benchmark different versions of Gemini on their ability to classify generalization when comparing evaluation tasks against retrieved data.
| Task | VLM | In-Dist Acc | Visual Acc | Behavioral Acc | Overall Acc |
|---|---|---|---|---|---|
| Pick-and-Place | |||||
| Gemini 3.0 Flash | 91.3 | 73.9 | 92.4 | 85.9 | |
| Gemini 3.0 Pro | 99.0 | 80.9 | 85.6 | 88.5 | |
| Gemini 3.1 Pro | 94.9 | 82.7 | 100.0 | 92.5 | |
| Unzip Lunchbag | |||||
| Gemini 3.0 Flash | 96.0 | 58.7 | 52.0 | 68.9 | |
| Gemini 3.0 Pro | 100.0 | 29.8 | 4.2 | 44.7 | |
| Gemini 3.1 Pro | 93.8 | 71.4 | 54.0 | 72.8 | |
| Fold Dress | |||||
| Gemini 3.0 Flash | 91.7 | 39.1 | 51.0 | 60.6 | |
| Gemini 3.0 Pro | 95.9 | 12.8 | 12.5 | 40.4 | |
| Gemini 3.1 Pro | 98.0 | 33.3 | 33.3 | 55.2 | |
We apply RADAR to two large-scale real-world datasets: Bridge V2 and a proprietary dataset of over 1M ALOHA 2 demonstrations, assessing evaluation tasks from prior benchmark work.
Use RADAR to analyze evaluation tasks against Bridge V2, using CogACT embeddings for retrieval. Choose a provided task example to see retrieval results and Gemini-produced generalization analysis. You can also try your own task examples by uploading an image and providing an instruction. This requires the embedding model to be loaded by the server (may take some time), and a Gemini API key to run Gemini analysis.
@article{gao2026radar,
title = {Grounding Robot Generalization in Training Data
via Retrieval-Augmented VLMs},
author = {Gao, Jensen and Sadigh, Dorsa and Huang, Sandy and Shah, Dhruv},
journal = {arXiv preprint arXiv:2603.11426},
year = {2026}
}