SAC and off-policy methods
When to reach beyond PPO. SAC's maximum-entropy framework, replay buffers, and the algorithms that win on real-world hardware where every minute of robot time matters.
PPO is on-policy: every batch of training data must come from the current policy. Throw away each batch after one or a few updates. Not great when robot time is expensive. SAC (Soft Actor-Critic, Haarnoja et al. 2018) is off-policy: it reuses every transition from a replay buffer. Sample-efficient, harder to stabilize, the algorithm of choice when training time on real hardware is the bottleneck.
What "off-policy" buys you
On-policy: train, throw away data, train. PPO needs ~10⁸ environment steps to learn a quadruped gait.
Off-policy: train, save data to a buffer, sample randomly from buffer, train more, repeat. SAC reaches similar performance with ~10⁶ steps — 100× fewer.
That ratio matters when:
- Each step costs real hardware wall-clock time. PPO needs hundreds of GPU-hours; SAC needs hours of robot-hours.
- You're constrained to a few thousand demos for safety reasons.
- You're doing real-world RL where resets are expensive.
The SAC architecture
Three networks (or four):
- Two Q-functions Q₁(s, a), Q₂(s, a). Trained against the Bellman target. Take the minimum to combat overestimation.
- Actor (policy) π(a | s). Trained to maximize Q minus entropy.
- Optional: target networks for the Q-functions, slow-EMA copies for stability.
The "soft" in Soft Actor-Critic: the objective adds an entropy bonus to the standard return:
J(π) = E[Σ_t r_t + α H(π(·|s_t))]
The entropy term \alpha H(\pi) encourages diverse actions. Larger \alpha → more exploration; smaller \alpha → more exploitation.
Practical bonus: SAC can auto-tune α by minimizing entropy of the policy toward a target value. This eliminates one major hyperparameter.
Why two Q-functions
Q-learning is biased upward — the max over actions in the Bellman target tends to overestimate. Single-Q methods (DQN, DDPG) are notoriously unstable. SAC takes the minimum of two independently trained Q-functions; both overestimate, the minimum is closer to truth.
This trick (clipped double Q-learning) is also in TD3, the deterministic-policy cousin of SAC.
The replay buffer
A circular buffer of recent transitions (s, a, r, s', done). Capacity 1e5 to 1e6. Sample uniformly each update.
Some variants:
- Prioritized experience replay (PER): sample more often from transitions with high TD error. Faster learning on hard problems; more code complexity.
- Hindsight experience replay (HER): relabel failed episodes with achieved-state goals. Turns sparse-reward tasks tractable. Combines with SAC well.
Hyperparameters that matter
| Parameter | Default | When to change |
|---|---|---|
| Learning rate | 3e-4 | Almost never |
| Buffer size | 1e6 | Smaller for short tasks; larger for hard exploration |
| Batch size | 256 | Bigger for visual SAC; smaller for tiny problems |
| Target entropy | −|A| | Sometimes; larger if exploration too low |
| Discount γ | 0.99 | Same as PPO |
| Tau (target net) | 0.005 | Almost never |
Six knobs vs PPO's eight. Less tuning anxiety.
SAC vs PPO on robotics tasks
| Task | Winner | Why |
|---|---|---|
| Quadruped locomotion in sim | PPO | Massive parallelism; sample-eff doesn't matter |
| Real-world arm RL | SAC | Robot time is precious; sample-eff dominates |
| Drone agility | PPO | Fast sim, abundant compute |
| Manipulation with sparse reward | SAC + HER | HER only works with off-policy |
The recent variants
- TD3: deterministic-policy cousin. Same clipped-double-Q, no entropy term. Simpler; SAC usually wins.
- RED-Q: more Q-functions (10), higher update-to-data ratio. Solves harder tasks faster.
- DroQ / TQC: dropout / quantile distributional Q-functions. Improvements on hard control tasks.
- HIL-SERL: SAC + human-in-the-loop demonstrations. Production real-world RL. Covered in the next lesson.
Common SAC failure modes
- Q-function blows up: extreme overestimation. Lower the LR, increase the target-network update rate (lower tau).
- Policy collapses: alpha auto-tuning with too-small target entropy. Set target entropy higher (less negative).
- Doesn't explore: alpha is too low; target entropy too negative. Watch the entropy curve.
- Replay buffer overwrites useful data: buffer too small. Increase capacity.
The meta-question: when to use RL at all
RL is the right tool when:
- You have a clear reward function but no demonstrations.
- You need superhuman performance (no demonstrator can give you the answer).
- Your task has a closed-loop feedback structure that benefits from optimization.
It's the wrong tool when:
- You have demonstrations but no good reward function. Use imitation learning + ACT or diffusion policy.
- The task is brief and one-shot. Classical control + perception is faster.
- Safety is paramount and you can't afford training failures. RL is brittle; hand-engineered control is auditable.
Implementation
Stable-Baselines3's SAC implementation is the cleanest reference. CleanRL has a single-file version (~400 lines) for learning. For production, RSL-RL or RL Games.
The minimum SAC training loop:
from stable_baselines3 import SAC
env = make_vec_env('Quadruped-v0', n_envs=8)
model = SAC('MlpPolicy', env, learning_rate=3e-4, buffer_size=1_000_000)
model.learn(total_timesteps=1_000_000)
model.save('quadruped_sac')
That's a working SAC agent. Add monitoring, evaluation, hyperparameter tuning. The training loop itself is unchanged.
Exercise
On the mujoco_playground PandaReach task, train both PPO and SAC. Plot success rate vs environment steps. SAC will hit 90% in ~50k steps; PPO needs ~500k. The 10× sample efficiency is what off-policy buys you. On hardware, that's 10× less robot wear.
Next
Real-world RL — when sample efficiency isn't enough; you need safety, resets, and demonstrations interleaved with online learning. HIL-SERL and the modern recipes.
Comments
Sign in to post a comment.