RobotForge
Published·~16 min

Sim-to-real: domain randomization playbook

A policy trained in sim that works on real hardware was science fiction in 2018. In 2026 it's a weekend project — if you get the randomization right. Here's the playbook.

by RobotForge
#learning#sim-to-real#rl#domain-randomization

A policy trained for a million simulated lifetimes, deployed on real hardware, working on the first try. This is the dream, and in 2026 it mostly works — for the right problems, with the right randomization. Here's the playbook distilled from five years of hobbyist and academic experience.

The reality gap

Any simulation is a lie in ten interesting ways. The policy trained in sim has learned not just how to do the task but also exactly how this specific sim misrepresents reality. On real hardware, those exploits disappear and the policy fails.

The fix isn't "make the sim better." It's "make the policy not care." Train on many slightly different sims — so many that real hardware looks like just another variant.

That's domain randomization.

What to randomize

Every sim-to-real success story randomizes some subset of these:

Mass and inertia

±20% on link masses is a standard start. Covers inaccurate CAD, inaccurate material density, unmodeled cable mass.

Friction

Wide range: 0.4–1.5 for the coefficient of static friction. The robot's feet/wheels/fingers experience vastly different friction on different surfaces.

Actuator dynamics

Real motors have: latency (5–30 ms for hobby servos), torque saturation (they can't produce infinite force), backlash (gear slop), and velocity limits. Add all of these to your sim.

Sensor noise

IMU gyro bias drift (0.01 rad/s/s is typical), position encoder noise (a few degrees), image-sensor noise, occasional dropouts. Simulate them.

External perturbations

Push forces, wind for drones, uneven ground. Random pushes with mean zero, peak 10–30% of the robot's weight, applied at random times.

Initial state

Every episode starts in a randomized position/orientation. Teaches the policy to recover from whatever state it finds itself in.

Geometry (advanced)

If you have the budget, randomize link lengths, joint offsets, geometry of the environment. Policies trained with this generalize to robots slightly different from the training robot.

How much to randomize

The pivotal question. Too little and you don't cover reality. Too much and the policy becomes conservative or fails to learn at all.

The playbook:

  1. Start with narrow ranges. Train to convergence.
  2. Gradually widen ranges. Retrain.
  3. Deploy on hardware. Observe where it fails.
  4. Identify which randomization the failure implicates and expand that one.

This is called automatic domain randomization (ADR) when done programmatically — OpenAI's Rubik's cube paper pioneered it. Even without automation, this iterative loop is how every sim-to-real success in practice works.

What makes sim-to-real more likely to work

Action space

Position control transfers better than torque control. Torque control is closer to the real physics but exposes more sim quirks. High-level action spaces (like "step in this direction") transfer even better because the low-level dynamics are abstracted.

Observation space

Avoid observations that are fragile in sim. Joint positions: fine. Velocities: noisy, include them with noise. Images: fragile, need texture and lighting randomization. Proprioceptive states (your own joint state) transfer more easily than exteroceptive states (cameras of the world).

Policy architecture

Feedforward MLPs work for many quadruped tasks. Recurrent policies (GRU, LSTM) help when the state is partially observed — the network learns to remember enough history to figure out where it is. Transformers work but are overkill for 90% of robot control problems.

Reward shaping

Dense rewards make training faster but can reward sim-specific hacks. Sparse rewards (reach goal / fall down) are slower but more robust. Add small safety penalties (joint limits, high accelerations) to discourage policies that would damage real hardware.

The minimum-viable stack for 2026

  1. MuJoCo MJX or Isaac Lab for batched simulation.
  2. PPO from Brax or stable-baselines3-jax. Not SAC — PPO is the default for on-policy RL with domain randomization.
  3. A randomization wrapper that samples physical parameters per-episode.
  4. A policy network: 3-layer MLP, 256 hidden units, is enough for most tasks.
  5. 1–10 million environment steps total. On a modern GPU, this is 30 minutes to a few hours with MJX.
  6. A real robot with a mat or catch net underneath. Early deploys will go surprisingly sideways.

Rough numbers from recent successes

  • MIT mini-cheetah (Kim et al. 2019): 1B+ steps in sim. Deploy worked first try on hardware.
  • ETH ANYmal (Hwangbo et al. 2019): trained ~24 hours on a single GPU, zero-shot transfer.
  • OpenAI Rubik's cube (2019): ADR over hundreds of randomized parameters; >10,000 GPU-days.
  • DeepMind MuJoCo Playground demos (2024): <8 hours of training on one RTX 4090 for many quadruped and manipulation tasks.

The compute cost has dropped 100× in five years. That's why this is now a hobbyist activity.

What will go wrong on your first try

  • Robot oscillates violently. Observation noise in sim was too small. Add noise to what you sense.
  • Policy bakes in the sim's numerical integrator. Use RK4 or smaller timesteps in sim; prefer higher-frequency sensor updates.
  • Feet slip or over-grip. Friction range was too narrow. Expand.
  • Policy commands impossible torques. Actuator limits were missing in sim. Add saturation.
  • Worked on robot A, not robot B. You didn't randomize geometry. Retrain with link-length randomization.

When sim-to-real breaks down entirely

  • Manipulation requiring fine force control. Hobby servos have poor torque fidelity. MuJoCo is better than it used to be but still limited.
  • Deformable objects (cloth, ropes, fluids). Active research — don't expect plug-and-play transfer in 2026.
  • Long-horizon tasks requiring memory. Policies trained in short episodes don't automatically generalize.

For these, hybrid approaches (imitation learning from teleop data + RL fine-tuning) are usually more practical than pure sim-to-real.

A weekend project

Clone mujoco_playground. Pick the QuadrupedTrot environment. Train PPO for 4 hours. Export the policy. Deploy on a Mini Pupper or SpotMicro clone. You'll almost certainly succeed on the first try — that success is remarkable and it wasn't possible two years ago.

Once you've watched your policy walk in sim AND on hardware, you'll understand why 2020–26 was the breakout era for legged robotics. Next lesson in this track: real-world RL — HIL-SERL and the techniques for training policies on hardware, no sim required.

Comments

    Sign in to post a comment.