PPO in practice: the hyperparameters that actually matter
The algorithm that dominates robotics RL. Honest about what's art and what's science. The 8 hyperparameters that determine whether your training succeeds, fails, or wastes a week.
PPO (Proximal Policy Optimization, Schulman et al. 2017) is the default RL algorithm for robotics in 2026. It's robust, well-documented, and works on most problems with reasonable defaults. The catch: "reasonable defaults" hide a dozen knobs that determine success, and most papers don't tell you which one mattered. Here's the practitioner's view.
What PPO actually does
Standard policy gradient methods take big steps. Big steps occasionally land in policy regions that are catastrophically worse — training diverges. PPO clips the policy update so each step is bounded. Specifically:
- Compute advantages A_t for each (state, action) in a batch.
- Compute the ratio r_t = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t).
- Clip ratio to [1−ε, 1+ε]. Take the minimum of the clipped and unclipped surrogate objectives.
The clipping prevents giant policy shifts in either direction. Result: training is monotonically non-deteriorating in expectation. Most of the time.
The 8 hyperparameters that matter
1. Learning rate
3e-4 is the default that works for 80% of robotics problems. For smaller networks (e.g. 64-unit MLPs on small tasks), 1e-3 is fine. For large networks or fine-tuning VLAs, 1e-5. Almost never higher than 1e-3.
Linear or cosine decay during training is standard. Constant LR works but slows down late convergence.
2. Number of parallel environments
Vectorize. The 2024 IsaacGym/MJX/Brax revolution: train with 1000–10,000 parallel environments. PPO benefits enormously from larger batches (lower-variance gradient estimates). Aim for 4096+ environments × 16-step rollouts = 65k transitions per batch.
Your training time scales sublinearly with parallelism. Use as much as your GPU allows.
3. Clip range ε
Default 0.2. Smaller (0.1) → safer but slower. Larger (0.3+) → faster but risk of divergence.
Anneal: start at 0.2, decrease toward 0 over training. Reduces late-training oscillations.
4. GAE λ (advantage estimator)
0.95 is standard. Trades bias vs variance:
- λ = 0: 1-step TD; low variance, high bias.
- λ = 1: Monte Carlo return; high variance, no bias.
- λ = 0.95: empirical sweet spot for most tasks.
5. Discount γ
0.99 for episodic tasks. 0.999 for very long horizons (continuous control with reward at every step). 0.9 occasionally for short, success/fail tasks.
Common mistake: setting γ too low for a task that requires long-horizon planning. The policy can't see beyond ~1/(1−γ) steps. Adjust by horizon, not by gut feeling.
6. Network size
Two hidden layers of 64–256 units MLP for joint-state inputs. Two ResNet18 encoders + small MLP for vision inputs. Larger isn't always better — easy to overfit with too much capacity on small datasets.
Use Tanh activations for proprioceptive policies (bounded outputs). ReLU for everything else.
7. Reward scaling and normalization
The single biggest source of "PPO works on every paper but mine." If reward magnitudes vary wildly (e.g., +100 for success, -1 every step), the value function struggles to fit and the policy gets unstable.
Standard practice:
- Running mean/std normalization on observations.
- Reward scaling: divide rewards by their running standard deviation. Caps any single transition's influence.
- Clip the value function loss to prevent huge updates.
If your training looks unstable, re-check normalization first.
8. Entropy bonus
Adds a small bonus for high-entropy policies, encouraging exploration. Coefficient typically 0.0001–0.01. Decay over training.
Too high → policy stays random forever. Too low → policy collapses to a deterministic mode early. Tune by watching the entropy curve during training: should decrease but not go to zero.
The numbers that work for quadruped locomotion
Anonymous-paper-aggregated empirical defaults for legged-robot RL:
learning_rate: 3e-4
n_envs: 4096
n_steps_per_env: 24 # rollout length
n_epochs: 5 # PPO updates per rollout
batch_size: 16384
clip_range: 0.2
gae_lambda: 0.95
gamma: 0.99
ent_coef: 0.005
value_loss_coef: 1.0
max_grad_norm: 1.0
network: [256, 256, 128]
total_steps: 100_000_000 # ~30 minutes on RTX 4090 with MJX
This is approximately what mujoco_playground's quadruped trainer uses. Start here; tune ent_coef and LR if convergence looks bad.
Common training failures
- Loss is NaN: divide-by-zero in advantage normalization, or reward explosion. Add reward clipping; check for division by std=0.
- KL divergence spikes: policy is making huge updates. Reduce learning rate or clip range.
- Policy entropy → 0 instantly: too low entropy bonus or too high LR. Increase ent_coef; decrease LR.
- Value loss stays flat: the value function isn't learning, gradient probably saturates. Check value-loss coefficient and value-target normalization.
- Average reward plateaus low: local optimum. Try a different random seed; tweak reward shaping; longer training.
Diagnostics worth logging
- Policy entropy: should decrease smoothly.
- Approx KL divergence: should stay below ~0.02 per update.
- Explained variance of value function: should reach 0.5+ for stable training.
- Episode length: should grow as policy learns to avoid early termination.
- Mean reward, success rate: the actual goal.
Log these to TensorBoard or W&B. When training fails, the answer is usually visible in one of them.
The seed lottery
Run the same PPO config with 5 different random seeds. Some succeed, some don't. This is real and well-documented. Production training: at least 3 seeds; report median performance. Single-seed claims are noise.
Tools in 2026
- RSL-RL: ETH's PPO implementation, optimized for legged-robot RL with IsaacGym/Lab. The cleanest production-grade reference.
- SB3 (Stable-Baselines3): solid, beginner-friendly, slower than RSL-RL.
- RL Games: PyTorch implementation popular with Isaac Lab users.
- CleanRL: single-file implementations of every algorithm; great for learning.
- Brax: JAX-based; pairs with MJX for ridiculous parallelism.
Exercise
In mujoco_playground, train PPO on the AntStand task with default hyperparameters. Then break it: set LR to 1e-2, watch divergence. Then break it the other way: LR to 1e-6, watch slowness. Then turn off reward normalization; watch instability. This negative-experimentation builds the intuition no paper teaches.
Next
SAC and off-policy methods — when sample efficiency matters more than stability.
Comments
Sign in to post a comment.