VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision–Language Models

ICLR 2026

Department of Computing, Imperial College London
VITA inference framework diagram

VITA is a test-time adaptation method that improves both generalization and temporal reasoning of VLMs for zero-shot goal-conditioned value function estimation.

Abstract

Vision–Language Models (VLMs) show promise as zero-shot goal-conditioned value functions, but their frozen pre-trained representations limit generalization and temporal reasoning. We introduce VITA, a zero-shot value function learning method that enhances both capabilities via test-time adaptation. At inference, a lightweight adaptation module is updated via a gradient step on a meta-learned self-supervised loss, such that each test-time update improves value estimation. By updating sequentially over a trajectory, VITA encodes history into its parameters, addressing the temporal reasoning limitations. To mitigate shortcut learning, we propose a dissimilarity-based sampling strategy that selects semantically diverse segments of the trajectory during training. In real-world robotic manipulation tasks, VITA generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art zero-shot method using autoregressive VLMs. In addition, we demonstrate that VITA's zero-shot value estimates can be utilized for reward shaping in offline reinforcement learning, resulting in multi-task policies on the Meta-World benchmark that exceed the performance of those trained with the simulation's fuzzy-logic dense rewards.

Test-Time Adaptation

VITA architecture diagram

Test-time adaptation. At each timestep t, a TTT adaptation module is updated via a gradient step on a meta-learned self-supervised loss, encoding temporal history into its parameters.

Our goal-conditioned value function estimator comprises three modules: (1) a multimodal encoder that extracts joint visual-language representations from visual trajectories and their task descriptions; (2) an adaptation module updated at test-time using a self-supervised loss that is meta-learned to improve value estimation; and (3) a regression head that predicts value estimates.

To adapt multimodal representations to both semantic and temporal context, we employ an adaptation module fadapt following the test-time training (TTT) paradigm (Sun et al.). At inference, at each timestep, the TTT adaptation module is updated via a gradient step on a meta-learned self-supervised loss, encoding both temporal history and improving value function estimation.


Meta-Learning

VITA training framework

Meta-learning. VITA learns a goal-conditioned value function via meta-learning and achieves zero-shot generalization to out-of-distribution trajectories via test-time adaptation.

VITA performs a gradient step on a meta-learned self-supervised loss, such that each test-time update improves value estimation. During training, the adaptation module fadapt is updated online at each timestep using the self-supervised loss \(\ell_{\text{self}}\). Following the gradient-based meta-learning paradigm (Finn et al., 2017), VITA's self-supervised task is meta-learned to improve value estimation rather than being predefined a priori. In addition, we propose a dissimilarity-based sampling strategy that increases intra-batch variance and acts as a form of importance sampling, emphasizing underrepresented but semantically meaningful segments of the trajectory.

Evaluating Generalization Under Distribution Shifts

(a) "Put red object into silver pot."

(b) "Fold clothes from left to right."

(c) "Sweep into pile."

(d) "Put orange object in drawer."

Examples of visual trajectories paired with task descriptions under different distribution shifts. (a) In-distribution. (b, c) Environment shift. (d) Embodiment and environment shift.

Dataset descriptions with task type, environment, and embodiment.

Dataset Task Type Environment Embodiment
tk_pnppick-and-placetoy kitchenWidowX 250
lm_pnppick-and-placelaundry machineWidowX 250
td_foldfold clothtabletop (dark wood)WidowX 250
ft_foldfold clothfolding tableWidowX 250
rd_foldfold clothrobot deskWidowX 250
ms_sweepsweepfolding table (tray)WidowX 250
dt_tk_pnppick-and-placetoy kitchenDeepThought
dt_tk_stackstack blockstoy kitchenDeepThought
dt_ft_stackstack blocksfolding tableDeepThought
dt_rd_pnppick-and-placerobot desk (drawer)DeepThought

VOC scores for value function estimation under distribution shifts. ID = In-Distribution, ES = Environment Shift, EM = Embodiment Shift, ES & EM = Both Shifts.

Shift Dataset VLM-CL VLM-RM CLIP-FT GVL-0S GVL-1S CLIP-GRU VITA
IDtk_pnp0.0380.0290.2510.2690.2520.7730.782
ESlm_pnp0.0170.0330.1490.3050.2720.6760.725
td_fold0.0310.0720.1520.3260.3180.6740.709
ft_fold0.1080.0990.1620.3310.3870.6930.658
rd_fold0.0950.0550.1260.3720.4060.7260.606
ms_sweep-0.129-0.2260.1480.1580.1500.4340.490
EMdt_tk_pnp0.042-0.0410.1490.2580.2110.8560.820
dt_tk_stack0.0350.0460.0990.2540.2770.6670.708
ES & EMdt_ft_stack0.0260.0280.0490.2120.2490.6740.698
dt_rd_pnp0.0230.0410.2110.3290.3160.7470.695

Expert vs. Non-Expert

Method BinVOC
VLM-CL0.40
VLM-RM0.00
CLIP-FT0.80
GVL-0S1.00
GVL-1S1.00
CLIP-GRU0.80
VITA1.00

Success in distinguishing expert from scripted robot demonstrations, measured by average BinVOC across 5 in-distribution scripted datasets.

Reward Shaping in Offline RL

Method IQM 95% CI
VLM-CL0.760[0.722, 0.791]
VLM-RM0.746[0.718, 0.771]
CLIP-FT0.785[0.759, 0.809]
CLIP-GRU0.777[0.734, 0.814]
META-WL0.779[0.750, 0.804]
VITA0.815[0.785, 0.838]

Offline RL performance on the Meta-World MT10 benchmark measured by interquartile mean (IQM) with 95% stratified bootstrap CIs.

BibTeX

@inproceedings{ziakas2026vita,
  title     = {VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision–Language Models},
  author    = {Ziakas, Christos and Russo, Alessandra},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
}