RobotForge
Published·~14 min

Diffusion policy explained

The 2023 breakthrough that brought generative modeling to robot control. Why predicting actions via denoising beats predicting them directly — and why multimodal action distributions matter.

by RobotForge
#learning#diffusion#imitation

For a decade, behavior cloning predicted "the action" given observations — one number out per joint. Diffusion policy (Chi et al., 2023) flipped this: predict the noise that would denoise into the action. The shift sounds esoteric. The empirical result: diffusion policies handled multimodal demonstrations (e.g., "go around the obstacle on either side") that direct regression couldn't.

The bug diffusion fixes

Imagine demonstrations of pushing a T-shape to a target: half the demos go around the obstacle on the left, half on the right. Direct behavior cloning, given the same starting state, has to predict one action. The MSE loss averages the two demonstrations into the middle — straight into the obstacle. Robot fails.

This is the multimodality problem. Real demonstrations are multimodal almost always. Direct regression collapses them.

The diffusion idea

Borrowed from image generation. Instead of predicting clean actions directly:

  1. Take training data (clean actions a^*).
  2. Add Gaussian noise at varying levels k=1..K to make a^*_k.
  3. Train a network to predict the noise added: \epsilon_\theta(a^*_k, o, k) \approx \epsilon.
  4. At inference: start with pure noise a_K \sim \mathcal{N}(0, I), iteratively denoise: a_{k-1} = a_k - \alpha_k \epsilon_\theta(a_k, o, k).

Why does this beat regression? The denoising model implicitly represents the full distribution of clean actions consistent with the observation. Sampling from it produces one mode at a time — left-around or right-around the obstacle, not the average.

The architecture

Original Diffusion Policy uses a 1D convolutional UNet over the action sequence. Inputs: observation embedding + noisy action sequence + diffusion step. Output: predicted noise of the same shape as the input action sequence.

Variants in 2024–26:

  • DiT (Diffusion Transformer): replaces UNet with a transformer. Cleaner; common in modern VLAs.
  • Flow matching: replaces stochastic denoising with deterministic flow. Faster inference, fewer steps. π0 uses this.
  • Consistency models: train to denoise in a single step. Latency-friendly for real-time control.

The action chunking detail

Diffusion policy predicts a sequence of actions (typically 16 future timesteps), not a single one. Two reasons:

  • Smoothness: the network sees the broader action structure, predicts coherent trajectories.
  • Inference cost: denoising is slow (10–100 steps); amortize over many timesteps of execution.

You execute, say, 8 of the 16 predicted actions before re-running diffusion. Effective control rate: original / 8. Used in ALOHA-style bimanual robots at ~25 Hz.

What diffusion policy gets right

  • Multimodal demonstrations: the canonical advantage. Multiple valid behaviors don't average into impossibility.
  • Long-horizon coherence: predicting chunks instead of single actions preserves task structure.
  • Robustness to small data: diffusion's implicit regularization helps with 200-demo datasets where regression overfits.
  • Conditional generation: easy to condition on goals, language, etc. Flexible.

What diffusion policy still struggles with

  • Inference speed: 10–50 denoising steps per chunk. Real-time hard control loops are out of scope without optimization.
  • Out-of-distribution: still imitation learning. Hasn't seen → can't do.
  • Force / contact: kinematic action space; tactile-rich tasks need extra modeling.

The minimum implementation

Install LeRobot. Use its Diffusion Policy implementation:

from lerobot.policies.diffusion import DiffusionPolicy

# Load weights or train from scratch on a LeRobotDataset
policy = DiffusionPolicy.from_pretrained('lerobot/diffusion_pusht')

obs = sensor.read()
action_chunk = policy.select_action(obs)  # 16 actions
for action in action_chunk[:8]:
    robot.execute(action)

That's the practical use. Twenty lines for a working policy.

Comparison with ACT

Action Chunking Transformer (next lesson) is the other dominant 2023 architecture. Both predict action chunks; ACT uses a transformer + variational autoencoder; diffusion uses denoising. Empirically:

  • Diffusion handles multimodal demos better.
  • ACT is faster at inference (single forward pass).
  • For most production tasks they perform within a few percentage points.

Modern VLAs (π0, OpenVLA-flow) increasingly use flow matching (a diffusion descendant) for action decoding.

Where diffusion shows up in 2026

  • π0 / π0.5: flow matching as the action decoder.
  • 3D Diffuser Actor: diffusion in SE(3) for end-effector poses.
  • Diffusion-VLA hybrids: pretrained on web data, fine-tuned with diffusion for actions.
  • RDT (Robotics Diffusion Transformer): diffusion-based VLA with strong public benchmarks.

Exercise

Run the original Diffusion Policy on the PushT task in LeRobot. Train for an hour on a single GPU; deploy in sim. Compare against an MLP behavior-cloning baseline trained on the same data. The success-rate difference (often 30 percentage points) is what diffusion buys you.

Next

ACT and action chunking — the other dominant architecture from the 2023 imitation-learning revolution.

Comments

    Sign in to post a comment.