Policy Gradient Guidance (PGG)

TL;DR

Guidance for PPO: We extend CFG to PPO without the need for the diffusion model architecture.
One knob at test time: Tune γ to modulate behavior without retraining.
Discrete vs. continuous: Dropout helps in simple discrete tasks; for MuJoCo, training with modest γ>1 is more stable than dropout.

Summary

We introduce Policy Gradient Guidance (PGG), a simple extension of classifier‑free guidance from diffusion models to classical policy gradient methods. CFG requires the diffusion model architecture, but we find PPO can be adapted with CFG through score function perspective with weighted policy mixture. PGG adds an unconditional policy branch π_u(a) alongside the standard conditional policy π_c(a|s), then forms a guided policy π̂(a|s) ∝ π_u(a)^1−γ · π_c(a|s)^γ via a single hyperparameter γ. This would be the same as the CFG when in the score function perspective. It provides a test‑time control knob that modulates behavior without retraining. On discrete control tasks and continuous control (MuJoCo), PGG with moderate γ values improves sample efficiency and final performance. The method requires no architecture changes to existing PPO implementations.

Policy Gradient Guidance Enables Test‑Time Control

Adding Classifier-Free Guidance to PPO for Test-Time Control

TL;DR

Summary