VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision

Abstract

Vision–Language Models (VLMs) show promise as zero-shot goal-conditioned value functions, but their frozen pre-trained representations limit generalization and temporal reasoning. We introduce VITA, a zero-shot value function learning method that enhances both capabilities via test-time adaptation. At inference, a lightweight adaptation module is updated via a gradient step on a meta-learned self-supervised loss, such that each test-time update improves value estimation. By updating sequentially over a trajectory, VITA encodes history into its parameters, addressing the temporal reasoning limitations. To mitigate shortcut learning, we propose a dissimilarity-based sampling strategy that selects semantically diverse segments of the trajectory during training. In real-world robotic manipulation tasks, VITA generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art zero-shot method using autoregressive VLMs. In addition, we demonstrate that VITA's zero-shot value estimates can be utilized for reward shaping in offline reinforcement learning, resulting in multi-task policies on the Meta-World benchmark that exceed the performance of those trained with the simulation's fuzzy-logic dense rewards.

Test-Time Adaptation

Test-time adaptation. At each timestep t, a TTT adaptation module is updated via a gradient step on a meta-learned self-supervised loss, encoding temporal history into its parameters.

Our goal-conditioned value function estimator comprises three modules: (1) a multimodal encoder that extracts joint visual-language representations from visual trajectories and their task descriptions; (2) an adaptation module updated at test-time using a self-supervised loss that is meta-learned to improve value estimation; and (3) a regression head that predicts value estimates.

To adapt multimodal representations to both semantic and temporal context, we employ an adaptation module f_adapt following the test-time training (TTT) paradigm (Sun et al.). At inference, at each timestep, the TTT adaptation module is updated via a gradient step on a meta-learned self-supervised loss, encoding both temporal history and improving value function estimation.

Meta-Learning

Meta-learning. VITA learns a goal-conditioned value function via meta-learning and achieves zero-shot generalization to out-of-distribution trajectories via test-time adaptation.

VITA performs a gradient step on a meta-learned self-supervised loss, such that each test-time update improves value estimation. During training, the adaptation module f_adapt is updated online at each timestep using the self-supervised loss \(\ell_{\text{self}}\). Following the gradient-based meta-learning paradigm (Finn et al., 2017), VITA's self-supervised task is meta-learned to improve value estimation rather than being predefined a priori. In addition, we propose a dissimilarity-based sampling strategy that increases intra-batch variance and acts as a form of importance sampling, emphasizing underrepresented but semantically meaningful segments of the trajectory.

Evaluating Generalization Under Distribution Shifts

(a) "Put red object into silver pot."

(b) "Fold clothes from left to right."

(c) "Sweep into pile."

(d) "Put orange object in drawer."

Examples of visual trajectories paired with task descriptions under different distribution shifts. (a) In-distribution. (b, c) Environment shift. (d) Embodiment and environment shift.

Dataset descriptions with task type, environment, and embodiment.

Dataset	Task Type	Environment	Embodiment
`tk_pnp`	pick-and-place	toy kitchen	WidowX 250
`lm_pnp`	pick-and-place	laundry machine	WidowX 250
`td_fold`	fold cloth	tabletop (dark wood)	WidowX 250
`ft_fold`	fold cloth	folding table	WidowX 250
`rd_fold`	fold cloth	robot desk	WidowX 250
`ms_sweep`	sweep	folding table (tray)	WidowX 250
`dt_tk_pnp`	pick-and-place	toy kitchen	DeepThought
`dt_tk_stack`	stack blocks	toy kitchen	DeepThought
`dt_ft_stack`	stack blocks	folding table	DeepThought
`dt_rd_pnp`	pick-and-place	robot desk (drawer)	DeepThought

VOC scores for value function estimation under distribution shifts. ID = In-Distribution, ES = Environment Shift, EM = Embodiment Shift, ES & EM = Both Shifts.

Shift	Dataset	VLM-CL	VLM-RM	CLIP-FT	GVL-0S	GVL-1S	CLIP-GRU	VITA
ID	`tk_pnp`	0.038	0.029	0.251	0.269	0.252	0.773	0.782
ES	`lm_pnp`	0.017	0.033	0.149	0.305	0.272	0.676	0.725
	`td_fold`	0.031	0.072	0.152	0.326	0.318	0.674	0.709
	`ft_fold`	0.108	0.099	0.162	0.331	0.387	0.693	0.658
	`rd_fold`	0.095	0.055	0.126	0.372	0.406	0.726	0.606
	`ms_sweep`	-0.129	-0.226	0.148	0.158	0.150	0.434	0.490
EM	`dt_tk_pnp`	0.042	-0.041	0.149	0.258	0.211	0.856	0.820
EM	`dt_tk_stack`	0.035	0.046	0.099	0.254	0.277	0.667	0.708
ES & EM	`dt_ft_stack`	0.026	0.028	0.049	0.212	0.249	0.674	0.698
ES & EM	`dt_rd_pnp`	0.023	0.041	0.211	0.329	0.316	0.747	0.695

Expert vs. Non-Expert

Method	BinVOC
VLM-CL	0.40
VLM-RM	0.00
CLIP-FT	0.80
GVL-0S	1.00
GVL-1S	1.00
CLIP-GRU	0.80
VITA	1.00

Success in distinguishing expert from scripted robot demonstrations, measured by average BinVOC across 5 in-distribution scripted datasets.

Reward Shaping in Offline RL

Method	IQM	95% CI
VLM-CL	0.760	[0.722, 0.791]
VLM-RM	0.746	[0.718, 0.771]
CLIP-FT	0.785	[0.759, 0.809]
CLIP-GRU	0.777	[0.734, 0.814]
META-WL	0.779	[0.750, 0.804]
VITA	0.815	[0.785, 0.838]

Offline RL performance on the Meta-World MT10 benchmark measured by interquartile mean (IQM) with 95% stratified bootstrap CIs.

BibTeX

@inproceedings{ziakas2026vita,
  title     = {VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision–Language Models},
  author    = {Ziakas, Christos and Russo, Alessandra},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
}

VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision–Language Models

ICLR 2026

VITA is a test-time adaptation method that improves both generalization and temporal reasoning of VLMs for zero-shot goal-conditioned value function estimation.