Aligning Flow Map Policies with
Optimal Q-Guidance

1Imperial College London, 2Mila
FMQ method overview

We introduce flow map policies: one-step generative policies that learn the jump operator of the probability flow ODE. We derive FMQ, a closed-form optimal learning target under a critic-guided trust-region constraint, and QGBS, a stochastic sampler combining renoising with beam search for inference-time refinement.

Abstract

Generative policies based on expressive model classes, such as diffusion and flow matching, are well-suited to complex control problems with highly multimodal action distributions. Their expressivity, however, comes at a significant inference cost: generating each action typically requires simulating many steps of the generative process, compounding latency across sequential decision-making rollouts. We introduce flow map policies, a novel class of generative policies designed for fast action generation by learning to take arbitrary-size jumps—including one-step jumps—across the generative dynamics of existing flow-based policies. We instantiate flow map policies for offline-to-online reinforcement learning (RL) and formulate online adaptation as a trust-region optimization problem that improves the critic's Q-value while remaining close to the offline policy. We theoretically derive Flow Map Q-Guidance (FMQ), a principled closed-form learning target that is optimal for adapting offline flow map policies under a critic-guided trust-region constraint. We further introduce Q-Guided Beam Search (QGBS), a stochastic flow-map sampler that combines renoising with beam search to enable iterative inference-time refinement. Across 12 challenging robotic manipulation and locomotion tasks from OGBench and RoboMimic, FMQ achieves state-of-the-art performance in offline-to-online RL, outperforming the previous one-step policy MVP by a relative improvement of 21.3% on the average success rate.

Key Contributions

  • Flow map policies. We introduce flow map policies as a framework for learning one-step policies as two-time jump operators for flow-based generative actors.
  • Algorithms. We introduce FMQ, which efficiently adapts flow map actors using optimal \(Q\)-guidance (Theorem 3.2). We further introduce QGBS, a stochastic inference-time refinement algorithm that combines flow map renoising, beam selection, and trust-region \(Q\)-guidance.
  • State-of-the-art performance. Across 12 manipulation and locomotion tasks from OGBench and RoboMimic, FMQ outperforms prior SOTA offline-to-online baselines by a relative average of 21.3% while being on average ≈2.77× more efficient during online adaptation.

Method

FMQ trust-region projection
QGBS renoising beam search

Figure 1. (Left) FMQ: one-step flow map policy transports noise \(a_0\) to action \(a_1\); then, trust-region projection displaces action \(a_1\) to \(a_1^*\) that maximizes \(Q\)-value. (Right) QGBS (\(M{=}1, B{=}2\)): renoising corrupts \(a_1^*\) into \(B\) intermediate states \(a_{t'}\), which the flow map policy then denoises to generate \(B\) candidate actions; candidates are updated via the optimal trust-region displacement to maximize \(Q_\phi\), and the highest-valued \(M\) actions are selected.

Flow Map Policy

We introduce flow map policies, a novel class of generative policies that learn the unique two-time jump operator [1] associated with the probability flow ODE of diffusion and flow-matching policies.

Definition (Flow-Map Policy)

Let \(X_{r,t}: [0,1]^2 \times \mathcal{S} \times \mathbb{R}^d \to \mathbb{R}^d\) be a flow map that evolves the action dynamics between any \((r,t) \in [0,1]\), conditioned on the MDP state \(s \in \mathcal{S}\) and satisfying the jump condition \(X_{r,t}(a_r|s) = a_t\). The flow-map policy is then the distribution induced by this map evaluated at time \(t=1\):

\[ \pi(a \mid s) = [X_{r,1}]_{\#}\, p_r(a_r \mid s) \]

FMQ: Trust-Region Online Adaptation

For principled online adaptation of one-step flow map policies, we formulate a trust-region optimization problem and derive an analytically optimal, closed-form method that aligns the action distribution with \(Q\)-value guidance.

Theorem 3.2 (Optimal Trust-Region Displacement)

Consider a flow map policy \(\pi^{\text{ref}}(\cdot|s)\) with underlying flow map \(X^{\text{ref}}_{r,1}\), generating actions \(a_1 = a_r + (1-r)\,u^{\text{ref}}_{r,1}(a_r \mid s)\). The optimal average velocity \(u^*_{r,1}\) that maximizes the first-order expansion of \(Q_\phi\) around \(a_1\), subject to trust-region constraint \(\|u_{r,1} - u^{\text{ref}}_{r,1}\|_2 \le \eta\), is:

\[ u^*_{r,1}(a_r \mid s) = u^{\text{ref}}_{r,1}(a_r \mid s) + \eta\, \frac{\nabla_a Q_\phi(s, a_1)}{\|\nabla_a Q_\phi(s, a_1)\|_2} \]
Algorithm 1: Flow Map \(Q\)-Guidance (FMQ)
Require: Offline policy \(u^{\text{off}}_{r,1}\), online policy \(u^\theta_{r,1}\), critics \(Q_{\phi_1}, Q_{\phi_2}\), buffer \(\mathcal{D}\)
for each environment step do
\(a_1 \gets a_0 + u^\theta_{0,1}(a_0|s)\),   \(a_0 \sim \mathcal{N}(0,I)\)
\(\mathcal{D} \gets \mathcal{D} \cup \{(s, a_1, r, s')\}\)
Sample batch from \(\mathcal{D}\); update critics
\(r \sim \mathcal{U}[0,1)\);   \(a_0 \sim \mathcal{N}(0,I)\);   \(a_r \gets (1{-}r)a_0 + r\,a_{\text{data}}\)
\(a_1 \gets a_r + (1{-}r)\,u^{\text{off}}_{r,1}(a_r|s)\)
\(g \gets \nabla_a Q_{\phi_1}(s,a_1) \;/\; (\|\nabla_a Q_{\phi_1}(s,a_1)\|_2 + \kappa_1)\)
\(\eta_{\text{eff}} \gets \eta \;/\; (1 + \beta\,\tilde{\delta}_{\text{critic}})\)
\(\theta \gets \theta - \alpha\,\nabla_\theta\|u^\theta_{r,1}(a_r|s) - \mathrm{sg}(u^{\text{off}}_{r,1}(a_r|s) + \eta_{\text{eff}}\,g)\|^2\)
end for

QGBS: Inference-Time Refinement

We additionally introduce Q-Guided Beam Search (QGBS), a complementary inference-time refinement procedure that iteratively improves actions via two steps:

  1. Exploration: Re-noise \(M\) candidates to an intermediate state \(t' = \rho/(1+\rho)\) and generate \(B\) completions each via the flow map, producing \(M \cdot B\) diverse candidate actions.
  2. Exploitation: Score all \(M \cdot B\) candidates with \(Q_\phi\), keep the top-\(M\), and apply the trust-region projection from Theorem 3.2.
Algorithm 2: \(Q\)-Guided Beam Search (QGBS)
Require: Flow map \(X^\theta_{r,1}\), critic \(Q_\phi\), state \(s\), beam \(M\), steps \(K\), branches \(B\), SNR \(\rho\), step size \(\eta\)
\(t' \gets \rho/(1{+}\rho)\)
Sample \(\{a_0^m\}_{m=1}^M \sim \mathcal{N}(0,I)\);   \(a_1^m \gets a_0^m + u^\theta_{0,1}(a_0^m|s)\) for all \(m\)
for \(k = 1, \ldots, K\) do
for \(m = 1, \ldots, M\) and \(b = 1, \ldots, B\) do
\(\varepsilon^{mb} \sim \mathcal{N}(0,I)\)
\(\hat{a}_1^{mb} \gets X^\theta_{t',1}\!\left(t'\,a_1^m + (1{-}t')\,\varepsilon^{mb} \mid s\right)\) ▹ Re-noise & complete
end for
\(\{a_1^m\}_{m=1}^M \gets \mathrm{Top\text{-}}M\!\left(\{\hat{a}_1^{mb}\}_{m,b};\; Q_\phi(s, \hat{a}_1^{mb})\right)\) ▹ Select best \(M\) of \(M{\cdot}B\)
\(a_1^m \gets a_1^m + \eta\,\nabla_a Q_\phi(s, a_1^m) / \|\nabla_a Q_\phi(s, a_1^m)\|_2\) for all \(m\) ▹ Thm. 3.2
end for
return \(a_1^{\arg\max_m Q_\phi(s, a_1^m)}\)

Main Results

Success rate (mean ± std over 5 seeds, 50 episodes). IQM with 95% CIs.

Environment QC [2] MVP [3] MVP + QGBS (Ours) FMQ (Ours) FMQ + QGBS (Ours)
can0.88 ± 0.060.83 ± 0.070.87 ± 0.070.96 ± 0.040.97 ± 0.03
square0.89 ± 0.040.82 ± 0.040.83 ± 0.050.94 ± 0.020.95 ± 0.04
cube-dbl-t31.00 ± 0.001.00 ± 0.001.00 ± 0.001.00 ± 0.001.00 ± 0.00
cube-dbl-t40.92 ± 0.050.98 ± 0.020.98 ± 0.020.98 ± 0.021.00 ± 0.00
cube-trl-t30.83 ± 0.080.64 ± 0.120.78 ± 0.120.78 ± 0.100.84 ± 0.04
cube-trl-t40.37 ± 0.260.32 ± 0.070.37 ± 0.090.88 ± 0.070.87 ± 0.05
scene-t40.99 ± 0.010.92 ± 0.020.98 ± 0.021.00 ± 0.000.99 ± 0.01
scene-t50.96 ± 0.020.90 ± 0.060.95 ± 0.050.98 ± 0.021.00 ± 0.00
hmaze-med-t30.65 ± 0.110.47 ± 0.100.53 ± 0.030.69 ± 0.040.58 ± 0.07
hmaze-med-t40.04 ± 0.030.00 ± 0.000.02 ± 0.020.06 ± 0.030.06 ± 0.03
amaze-gnt-t40.64 ± 0.120.42 ± 0.060.43 ± 0.040.80 ± 0.060.77 ± 0.03
amaze-gnt-t50.91 ± 0.050.82 ± 0.080.90 ± 0.060.92 ± 0.040.92 ± 0.05
IQM [95% CI] 0.86 [0.84, 0.87] 0.75 [0.73, 0.77] 0.81 [0.78, 0.83] 0.91 [0.89, 0.93] 0.93 [0.91, 0.94]

Training Curves

Training curves
Figure 2. Offline-to-online learning curves for QC, MVP, and FMQ. All methods perform 1M offline followed by 1M online steps. Shaded regions indicate 95% CIs over 5 seeds.

Convergence Speedup

FMQ reaches the highest success rate achievable by MVP 2.77× faster on average during the online phase, and up to 6.14× on humanoidmaze-medium-t3. The Q-gradient alignment provides a stronger learning signal than best-of-N selection, leading to faster policy improvement per environment step.

Convergence speedup of FMQ over MVP
Figure 3. Convergence speedup of FMQ compared to MVP at success targets (ξ), with 95% CIs.

QGBS Efficiency

The computational cost of QGBS is NFE = M(1 + K·B) per action selection, where M is the number of initial candidates, K is the number of renoising steps, and B is the number of completions per candidate. Best-of-N sampling corresponds to K=0 and M=N. The optimal configuration (K=1, B=4, M=4) achieves a peak IQM of 0.93 with only 20 NFE—37.5% fewer than best-of-32—suggesting that diversifying candidates through renoising is more efficient than simply drawing more candidates. Increasing K beyond 1 does not improve performance, indicating that only a modest increase in inference cost is needed for optimal results.

K {B, M} NFE IQM
0{1, 32}320.91
1{4, 4}200.93
1{2, 8}240.93
1{1, 16}320.92
2{4, 4}360.91
2{4, 16}1440.90

Optimal: K=1, B=4, M=4 achieves peak IQM with only 20 NFE (37.5% fewer than best-of-32).

Task Rollouts

Visualizations of FMQ policy rollouts across all 12 evaluation environments.

can
square
cube-double-t3
cube-double-t4
cube-triple-t3
cube-triple-t4
scene-t4
scene-t5
hmaze-med-t3
hmaze-med-t4
amaze-gnt-t4
amaze-gnt-t5

References

[1] N. M. Boffi, M. S. Albergo, and E. Vanden-Eijnden. How to build a consistency model: Learning flow maps via self-distillation. 2025. arXiv:2505.18825
[2] Q. Li, Z. Zhou, and S. Levine. Reinforcement learning with action chunking. NeurIPS, 2025. arXiv:2507.07969
[3] G. Zhan, L. Tao, P. Wang, Y. Wang, Y. Li, Y. Chen, H. Li, M. Tomizuka, and S. E. Li. Mean flow policy with instantaneous velocity constraint for one-step action generation. ICLR, 2026. arXiv:2602.13810

BibTeX

@article{ziakas2026fmq,
  title={Aligning Flow Map Policies with Optimal Q-Guidance},
  author={Ziakas, Christos and Russo, Alessandra and Bose, Avishek Joey},
  journal={arXiv preprint arXiv:2605.12416},
  year={2026}
}