Policy Gradient Guidance Enables Test‑Time Control

Adding Classifier-Free Guidance to PPO for Test-Time Control

Jianing Qi1 · Hao Tang2 · Zhigang Zhu3

1CUNY Graduate Center   2BMCC, CUNY   3CCNY, CUNY

PGG teaser figure
Experiments on discrete control tasks showed that PGG can improve performance espicially with low training data

TL;DR

Summary

We introduce Policy Gradient Guidance (PGG), a simple extension of classifier‑free guidance from diffusion models to classical policy gradient methods. CFG requires the diffusion model architecture, but we find PPO can be adapted with CFG through score function perspective with weighted policy mixture. PGG adds an unconditional policy branch πu(a) alongside the standard conditional policy πc(a|s), then forms a guided policy π̂(a|s) ∝ πu(a)1−γ · πc(a|s)γ via a single hyperparameter γ. This would be the same as the CFG when in the score function perspective. It provides a test‑time control knob that modulates behavior without retraining. On discrete control tasks and continuous control (MuJoCo), PGG with moderate γ values improves sample efficiency and final performance. The method requires no architecture changes to existing PPO implementations.