Published2026-04-26·~13 min

SAC and off-policy methods

When to reach beyond PPO. SAC's maximum-entropy framework, replay buffers, and the algorithms that win on real-world hardware where every minute of robot time matters.

by RobotForge

#learning#sac#rl#off-policy

PPO is on-policy: every batch of training data must come from the current policy. Throw away each batch after one or a few updates. Not great when robot time is expensive. SAC (Soft Actor-Critic, Haarnoja et al. 2018) is off-policy: it reuses every transition from a replay buffer. Sample-efficient, harder to stabilize, the algorithm of choice when training time on real hardware is the bottleneck.

What "off-policy" buys you

On-policy: train, throw away data, train. PPO needs ~10⁸ environment steps to learn a quadruped gait.

Off-policy: train, save data to a buffer, sample randomly from buffer, train more, repeat. SAC reaches similar performance with ~10⁶ steps — 100× fewer.

That ratio matters when:

Each step costs real hardware wall-clock time. PPO needs hundreds of GPU-hours; SAC needs hours of robot-hours.
You're constrained to a few thousand demos for safety reasons.
You're doing real-world RL where resets are expensive.

The SAC architecture

Three networks (or four):

Two Q-functions Q₁(s, a), Q₂(s, a). Trained against the Bellman target. Take the minimum to combat overestimation.
Actor (policy) π(a | s). Trained to maximize Q minus entropy.
Optional: target networks for the Q-functions, slow-EMA copies for stability.

The "soft" in Soft Actor-Critic: the objective adds an entropy bonus to the standard return:

J(π) = E[Σ_t r_t + α H(π(·|s_t))]

The entropy term \alpha H(\pi) encourages diverse actions. Larger \alpha → more exploration; smaller \alpha → more exploitation.

Practical bonus: SAC can auto-tune α by minimizing entropy of the policy toward a target value. This eliminates one major hyperparameter.

Why two Q-functions

Q-learning is biased upward — the max over actions in the Bellman target tends to overestimate. Single-Q methods (DQN, DDPG) are notoriously unstable. SAC takes the minimum of two independently trained Q-functions; both overestimate, the minimum is closer to truth.

This trick (clipped double Q-learning) is also in TD3, the deterministic-policy cousin of SAC.

The replay buffer

A circular buffer of recent transitions (s, a, r, s', done). Capacity 1e5 to 1e6. Sample uniformly each update.

Some variants:

Prioritized experience replay (PER): sample more often from transitions with high TD error. Faster learning on hard problems; more code complexity.
Hindsight experience replay (HER): relabel failed episodes with achieved-state goals. Turns sparse-reward tasks tractable. Combines with SAC well.

Hyperparameters that matter

Parameter	Default	When to change
Learning rate	3e-4	Almost never
Buffer size	1e6	Smaller for short tasks; larger for hard exploration
Batch size	256	Bigger for visual SAC; smaller for tiny problems
Target entropy	−\|A\|	Sometimes; larger if exploration too low
Discount γ	0.99	Same as PPO
Tau (target net)	0.005	Almost never

Six knobs vs PPO's eight. Less tuning anxiety.

SAC vs PPO on robotics tasks

Task	Winner	Why
Quadruped locomotion in sim	PPO	Massive parallelism; sample-eff doesn't matter
Real-world arm RL	SAC	Robot time is precious; sample-eff dominates
Drone agility	PPO	Fast sim, abundant compute
Manipulation with sparse reward	SAC + HER	HER only works with off-policy

The recent variants

TD3: deterministic-policy cousin. Same clipped-double-Q, no entropy term. Simpler; SAC usually wins.
RED-Q: more Q-functions (10), higher update-to-data ratio. Solves harder tasks faster.
DroQ / TQC: dropout / quantile distributional Q-functions. Improvements on hard control tasks.
HIL-SERL: SAC + human-in-the-loop demonstrations. Production real-world RL. Covered in the next lesson.

Common SAC failure modes

Q-function blows up: extreme overestimation. Lower the LR, increase the target-network update rate (lower tau).
Policy collapses: alpha auto-tuning with too-small target entropy. Set target entropy higher (less negative).
Doesn't explore: alpha is too low; target entropy too negative. Watch the entropy curve.
Replay buffer overwrites useful data: buffer too small. Increase capacity.

The meta-question: when to use RL at all

RL is the right tool when:

You have a clear reward function but no demonstrations.
You need superhuman performance (no demonstrator can give you the answer).
Your task has a closed-loop feedback structure that benefits from optimization.

It's the wrong tool when:

You have demonstrations but no good reward function. Use imitation learning + ACT or diffusion policy.
The task is brief and one-shot. Classical control + perception is faster.
Safety is paramount and you can't afford training failures. RL is brittle; hand-engineered control is auditable.

Implementation

Stable-Baselines3's SAC implementation is the cleanest reference. CleanRL has a single-file version (~400 lines) for learning. For production, RSL-RL or RL Games.

The minimum SAC training loop:

from stable_baselines3 import SAC

env = make_vec_env('Quadruped-v0', n_envs=8)
model = SAC('MlpPolicy', env, learning_rate=3e-4, buffer_size=1_000_000)
model.learn(total_timesteps=1_000_000)
model.save('quadruped_sac')

That's a working SAC agent. Add monitoring, evaluation, hyperparameter tuning. The training loop itself is unchanged.

Exercise

On the mujoco_playground PandaReach task, train both PPO and SAC. Plot success rate vs environment steps. SAC will hit 90% in ~50k steps; PPO needs ~500k. The 10× sample efficiency is what off-policy buys you. On hardware, that's 10× less robot wear.

Real-world RL — when sample efficiency isn't enough; you need safety, resets, and demonstrations interleaved with online learning. HIL-SERL and the modern recipes.

← Previous

PPO in practice: the hyperparameters that actually matter

Sim-to-real: domain randomization playbook