tutorial2026-04-08·16 min read

Reinforcement learning on real robots: what actually works in 2026

Sim-to-real is finally good enough for amateurs. Here's the minimal stack: MuJoCo MJX, domain randomization, PPO, and a real quadruped on a soft mat.

by RobotForge

#reinforcement-learning #sim-to-real #mujoco #quadruped

Five years ago, sim-to-real RL meant six months of domain randomization tuning and a $30K robot. In 2026, weekend hobbyists are deploying learned gaits on $200 quadrupeds. What changed — and the minimal stack you need to follow along.

What changed

MuJoCo went open. DeepMind released it Apache-2.0 in 2021. Since MJX (JAX port), batched sim runs 10,000× faster than before.
Simple algorithms are enough. PPO on a well-randomized sim works. You don't need SAC, you don't need a fancy world model.
Hobby servos are predictable enough. Cheap position servos have high latency (10–30ms) and torque quirks, but they're stable enough to model — and you can train with that latency in sim.

The minimal stack

MuJoCo MJX — batched physics in JAX
Brax or Stable-Baselines3 — RL algorithms (PPO)
Domain randomization — randomize mass, friction, joint damping, actuator delay, observation noise during training
Robot — quadruped like Mini Pupper, SpotMicro, or a Poppy-style homebrew
Soft mat — seriously. Your early policies will do very weird things. Save the robot.

The training loop

Write a MuJoCo XML of your robot. Measure link lengths with calipers, guess mass from printed-part weights.
Write a reward function. Start embarrassingly simple: stay upright + move forward.
Randomize everything that isn't the geometry. Friction 0.4–1.2. Mass ±20%. Actuator delay 5–30ms.
Train PPO for 10M steps (~30 min on a laptop GPU with MJX).
Export policy weights as ONNX or raw NumPy.
Run on robot via ONNX Runtime or a hand-written inference loop.

What will go wrong on first deployment

Robot will oscillate violently because sim had no observation noise but real sensors do. Fix: add Gaussian noise to joint position readings in sim.
Feet will slip. Fix: lower sim friction randomization bounds, or add rubber feet.
Robot will lunge forward and self-destruct. Fix: shorter episodes in sim, action smoothing regularizer.

Resources

DeepMind's mujoco_playground is the best starting point in 2026. It ships working examples for quadrupeds, arms, and bipeds. Clone it, swap in your XML, adjust the reward function. That's where I'd start.

Liked this?

Get a new one in your inbox. No spam. Unsubscribe in one click.