Published2026-04-26·~13 min

Reinforcement-learned gaits: the 2024–26 revolution

Five years ago, every quadruped used hand-tuned MPC. Today they all use RL. Here's the recipe — domain randomization, action space choices, real-world deployment — that ate the field.

by RobotForge

#mobile-robots#quadruped#rl

In 2018, every commercial quadruped used model predictive control with hand-tuned cost functions. By 2024, almost every one used reinforcement learning. The transition was fast because the recipe was clean: train PPO in a randomized simulator, deploy zero-shot. Five years and a few key papers later, RL is the default and classical MPC is the backup.

What changed

Three things converged:

Massively parallel simulation: Isaac Gym (2021) and MuJoCo MJX (2022) made it possible to run thousands of simulated quadrupeds in parallel on one GPU. Sample efficiency stopped being the bottleneck.
Domain randomization: randomize friction, mass, motor properties, terrain. The policy generalizes to real hardware without per-robot tuning.
The Hwangbo paper (2019): ETH's "Learning agile and dynamic motor skills for legged robots" demonstrated zero-shot sim-to-real on ANYmal. Once the recipe worked, everyone copied.

By 2026, hand-tuning a quadruped controller is unusual.

The recipe (production-tested)

1. The simulator

MuJoCo MJX or Isaac Lab. Both run thousands of envs per GPU. Pick whichever your team already uses; both work.

2. The robot model

URDF or MJCF. Include accurate inertias (measured, not guessed). Add motor models with realistic torque limits, friction, and gear ratios.

3. The action space

Three options:

Joint position targets: policy outputs target angles; PD controller tracks. Most common; transfers best.
Joint velocity / torque: more direct but harder to transfer.
Higher-level commands: foot placement targets; PD-tracked. Decouples policy learning from low-level control.

Production default in 2026: joint position targets at ~50 Hz, with a 1 kHz PD controller below.

4. The observation space

Joint positions and velocities (proprioception).
IMU (gravity vector, body angular velocity).
Last few actions (motor latency awareness).
Velocity command (forward/sideways/turning rate).
Optional: terrain height sample around the robot (for rough-terrain policies).

Notably absent: vision. Most quadruped policies are blind. They handle terrain by feel — joint torques and IMU. Vision is added by a separate higher-level planner that picks paths, not joint motion.

5. The reward

Standard recipe (per ETH's "Learning quadrupedal locomotion over challenging terrain" 2020):

Tracking: match commanded velocity. Positive.
Energy: penalize joint torque squared. Negative.
Smoothness: penalize action rate of change. Negative.
Safety: penalize joint near limits, base orientation deviation, foot slip.
Survival: terminate episodes on falls, large penalty.

~10 reward terms with hand-tuned weights. Not pretty; works.

6. Domain randomization

For sim-to-real:

Mass: ±20% per link.
Friction: 0.3 to 1.5.
Motor delay: 5–30 ms.
Motor torque: ±20%.
Terrain: random heights, slopes, gaps.
Initial pose: random.
External pushes: random forces during episodes.

Each parameter is randomized at the start of every episode. Policy learns to be robust to all of them.

7. The algorithm

PPO. Almost universally. Some teams use SAC or DDPG variants for off-policy learning. PPO's stability + parallel sample efficiency makes it the default.

8. The training run

1–10 billion environment steps. ~12–24 hours on one RTX 4090. Convergence is monitored by tracking reward + qualitative gait.

9. Deployment

Export the policy to ONNX or directly as PyTorch. Run at 50 Hz on the robot's onboard computer (even a Raspberry Pi handles it for small policies). Zero-shot from sim — no real-world fine-tuning typically needed.

What this gives you

A walking policy that:

Tracks velocity commands at 0.5–3 m/s.
Survives unexpected pushes.
Handles uneven terrain (steps, slopes, debris).
Recovers if it slips.
Doesn't fall when carrying unexpected loads.

All without explicit motion planning. The policy "is" the controller and the planner.

What it doesn't give you

Specific maneuvers: backflips, complicated dance moves. Need separate skill-specific policies or a high-level skill selector.
Reasoning: "this stair is a bit different" — the policy reacts but doesn't plan ahead.
Vision-conditioned behaviors: a bare RL policy is blind. Add visual policies (covered in Learning track) on top.
Provable stability: pure RL gives no stability guarantees. Hybrid stacks (classical layer + RL refinement) get you both.

The hierarchical pattern

Modern quadruped stacks:

High-level planner (vision, navigation): picks path goals.
Velocity command generator: takes path; emits 50 Hz commanded twist.
RL low-level policy: takes commanded twist + proprioception; emits joint targets.
PD joint tracking: at 1 kHz on the motor controllers.

Each layer's output is the next layer's input. The RL policy is just one layer.

The 2024+ humanoid extension

What worked for quadrupeds is being applied to bipedal humanoids:

Tesla Optimus, 1X NEO: end-to-end RL policy for walking.
Boston Dynamics Atlas: RL on top of MPC for some skills (parkour, manipulation).
Open-source: PHC, HumanoidGym (2024). Run examples; train; deploy.

Bipedal RL is harder than quadrupedal — more underactuation, smaller stability margin, harder reset. But the same recipe (PPO + domain randomization + standard reward) is producing real working policies.

The libraries

Isaac Lab: NVIDIA's RL framework. Ships ANYmal, Spot, Cassie, G1 examples.
MuJoCo Playground (Google DeepMind, 2024): clean MJX-based examples; quadrupeds and humanoids.
Brax: Google's older JAX-based RL framework. Still useful for research.
Stable-Baselines3: classical RL (PyTorch); pair with your own simulator.

For 2026 production: Isaac Lab if you have NVIDIA hardware, MuJoCo Playground otherwise.

Common gotchas

Reward hacking: the policy finds an unintended way to maximize reward. Watch the gait visually; tune reward weights.
Sim-to-real gap on terrain: simulator terrain is too clean. Add noise, debris, slip patches.
Action smoothness: without action-rate penalty, policies emit chattering actions. Always include a smoothness term.
Initialization sensitivity: starting from kneeling vs standing changes learned strategies. Randomize.

Exercise

Clone MuJoCo Playground. Run the QuadrupedTrot example. Train PPO for ~12 hours on a GPU. Watch the walking emerge. Then deploy on a Mini Pupper or Spot Micro (~$200–500 hardware). The first time your trained policy walks on a real quadruped, the field's progress is no longer abstract.

Drone control — the same hierarchy applied to a flying body, where underactuation is structural rather than emergent.

← Previous

Whole-body control for humanoids

Drone control: from PID to nonlinear MPC