Grounding Robot Generalization in Training Data via Retrieval-Augmented VLMs

1Stanford University    2Google DeepMind    3Princeton University

Recent work on robot manipulation has advanced policy generalization to novel scenarios. However, it is often difficult to characterize how different evaluation settings actually represent generalization from the training distribution of a given policy. Towards more precise evaluation of generalization in robotics, we propose RADAR, a scalable framework for directly comparing test-time evaluation tasks to policy training data, to determine what form of policy generalization is required. RADAR consists of a two-stage pipeline: first, retrieval using generalist policy embeddings identifies which training examples are relevant for a given evaluation task. Next, vision-language models (VLMs) analyze the evaluation task against the retrieved data, outputting interpretable analysis on how they compare along a variety of axes, and an overall classification of what type of policy generalization is required. Through controlled experiments, we demonstrate that VLMs are effective at analyzing data for generalization, and that our retrieval step effectively identifies examples needed to make accurate classifications with respect to the training data. Furthermore, we scale RADAR to large-scale datasets, where we observe agreement with human-defined benchmark conditions from prior work.

The RADAR Framework

RADAR determines what form of generalization a robot policy requires at evaluation time by comparing evaluation scenarios directly to training data — using VLA embeddings for retrieval and a VLM for interpretable analysis.

RADAR method overview
RADAR overview. Given an evaluation scenario $\tau^\text{test}$ and a training dataset $\mathcal{D}$, RADAR first retrieves a subset $\mathcal{D}_\text{retrieval}$ using embeddings from a generalist VLA policy. A VLM then analyzes the retrieved examples against the evaluation scenario, outputting per-axis descriptions and an overall classification: in-distribution, visual generalization, or behavioral generalization.

Stage 1 — Retrieval via VLA Embeddings

To scale to large datasets, RADAR first uses embedding-based retrieval to identify the $k$ nearest neighbors of the evaluation scenario $\tau^\text{test}$ in the training data $\mathcal{D}$. We propose using internal representations from generalist vision-language-action (VLA) policies as embeddings.

VLA embeddings inherit two key properties: visual and semantic invariance from VLM pretraining, and behavioral sensitivity from large-scale robot data training. These properties are important in our framework to identify examples needed for accurate downstream analysis (see paper for details).

Stage 2 — VLM Analysis of Generalization

Given $\mathcal{D}_\text{retrieval}$, a VLM analyzes each retrieved example against $\tau^\text{test}$ along seven generalization axes drawn from the $\bigstar$-Gen taxonomy: Image Augmentations, Visual Task Object, Visual Scene, Object Poses, Morphed Objects, New Object, and Interacting Scene.

For each retrieved example, the VLM predicts whether each axis is in-distribution and classifies the example as same, a visual perturbation, or a behavioral perturbation. These predictions are aggregated to give a final classification: in-distribution, visual generalization, or behavioral generalization.

Controlled Experiments

We evaluate RADAR on three task families on the ALOHA 2 platform, with over 2,300 annotated task variations that are unseen by the VLAs whose embeddings we use.

Task Families

Task family examples
Three task families used in controlled experiments: Pick-and-Place, Unzip Lunchbag, and Fold Dress. For each task family, we provide an example base task instance (left) and variations across different generalization axes (right).

Retrieval Performance

We evaluate how recall of relevant training examples scales with retrieval set size. We compare VLA-based embeddings ($\pi_0$, $\pi_{0.5}$, Gemini Robotics On-Device/GROD) against baselines that do not use robot data (PaliGemma 2, SigLIP 2, DINOv3).

In-distribution retrieval recall
In-distribution recall. All VLA-based embeddings and DINOv3 achieve >95% recall retrieving less than 5% of the dataset.
Visual generalization retrieval recall
Visual generalization recall. GROD generally performs the best, with $\pi_{0.5}$ as the only other competitive model. VLA embeddings significantly outperform non-behavior-aware baselines (PG2, SigLIP 2, DINOv3).

Embedding Distance Structure

VLA embeddings — especially GROD — exhibit larger distances for behavioral perturbations (blue) than visual perturbations (orange), a key property enabling effective retrieval. Baselines that do not use robot data lack this structure.

Embedding distances part 1 Embedding distances part 2
Distribution of embedding distances induced by task perturbations across different axes for Pick-and-Place. VLA embeddings (especially GROD) separate behavioral axes (blue) from visual axes (orange) more distinctly than all other methods.

VLM Analysis Results

We benchmark different versions of Gemini on their ability to classify generalization when comparing evaluation tasks against retrieved data.

Task VLM In-Dist Acc Visual Acc Behavioral Acc Overall Acc
Pick-and-Place
Gemini 3.0 Flash 91.373.992.485.9
Gemini 3.0 Pro 99.080.985.688.5
Gemini 3.1 Pro 94.982.7100.092.5
Unzip Lunchbag
Gemini 3.0 Flash 96.058.752.068.9
Gemini 3.0 Pro 100.029.84.244.7
Gemini 3.1 Pro 93.871.454.072.8
Fold Dress
Gemini 3.0 Flash 91.739.151.060.6
Gemini 3.0 Pro 95.912.812.540.4
Gemini 3.1 Pro 98.033.333.355.2
RADAR achieves up to 92.5% overall classification accuracy on Pick-and-Place. Performance is lower on tasks with more subtle visual differences (Unzip Lunchbag, Fold Dress), highlighting a current limitation of VLMs that improves with stronger models.
VLM success and failure examples
VLM successes (green) and failures (red) from Gemini 3.0 Flash for "Object Poses" and "Morphed Objects" axes. Failure modes typically involve subtle visual changes — such as a slight lunchbag rotation or minor size differences — that are challenging even for state-of-the-art VLMs.

Large-Scale Analysis

We apply RADAR to two large-scale real-world datasets: Bridge V2 and a proprietary dataset of over 1M ALOHA 2 demonstrations, assessing evaluation tasks from prior benchmark work.

Large-scale RADAR examples
RADAR applied to large-scale ALOHA 2 benchmarking. For each evaluation task (left), RADAR retrieves relevant training examples (middle) and a VLM provides interpretable analysis with an overall classification (right). We observe agreement with prior human-defined characterizations of generalization.

Try RADAR

Use RADAR to analyze evaluation tasks against Bridge V2, using CogACT embeddings for retrieval. Choose a provided task example to see retrieval results and Gemini-produced generalization analysis. You can also try your own task examples by uploading an image and providing an instruction. This requires the embedding model to be loaded by the server (may take some time), and a Gemini API key to run Gemini analysis.

Enter a server URL to load example queries
Drop an image or browse Robot Observation Image

Gemini Analysis

Stored only in your browser's localStorage. Never sent to any server other than Google.

BibTeX

@article{gao2026radar,
  title   = {Grounding Robot Generalization in Training Data
             via Retrieval-Augmented VLMs},
  author  = {Gao, Jensen and Sadigh, Dorsa and Huang, Sandy and Shah, Dhruv},
  journal = {arXiv preprint arXiv:2603.11426},
  year    = {2026}
}