Vision–Language Models (VLMs) show promise as zero-shot goal-conditioned
value functions, but their frozen pre-trained representations limit generalization
and temporal reasoning. We introduce VITA, a zero-shot value function learning
method that enhances both capabilities via test-time adaptation. At inference, a
lightweight adaptation module is updated via a gradient step on a meta-learned
self-supervised loss, such that each test-time update improves value estimation. By
updating sequentially over a trajectory, VITA encodes history into its parameters,
addressing the temporal reasoning limitations. To mitigate shortcut learning, we
propose a dissimilarity-based sampling strategy that selects semantically diverse
segments of the trajectory during training.
In real-world robotic manipulation tasks,
VITA generalizes from a single training environment to diverse out-of-distribution
tasks, environments, and embodiments, outperforming the state-of-the-art zero-shot
method using autoregressive VLMs. In addition, we demonstrate that VITA's
zero-shot value estimates can be utilized for reward shaping in offline reinforcement
learning, resulting in multi-task policies on the Meta-World benchmark that exceed
the performance of those trained with the simulation's fuzzy-logic dense rewards.