RobotForge
Published·~15 min

Imitation learning 101: behavior cloning and DAgger

The simplest way to train a robot policy: show it what to do, then make it copy. Here's how behavior cloning works, why it fails on its own, and the fix (DAgger) that made modern VLAs possible.

by RobotForge
#learning#imitation-learning#behavior-cloning

Before reinforcement learning scales, before VLAs, there was imitation learning: show the robot what to do, then train it to copy you. It's the simplest recipe that actually works — and it's the training signal behind every modern VLA, including π0 and Gemini Robotics. Here's the mental model.

Behavior cloning, in one page

You record the expert (a human teleoperator) controlling the robot. Each frame gives you a state s and the action a the expert chose.

That's a dataset: D = {(s_1, a_1), (s_2, a_2), ..., (s_N, a_N)}.

Train a neural net π(s) to predict a from s. Loss is supervised — MSE for continuous actions, cross-entropy for discrete. That's it. That's behavior cloning.

import torch
import torch.nn as nn

policy = nn.Sequential(
    nn.Linear(state_dim, 256), nn.ReLU(),
    nn.Linear(256, 256), nn.ReLU(),
    nn.Linear(256, action_dim),
)
opt = torch.optim.Adam(policy.parameters(), lr=3e-4)

for epoch in range(n_epochs):
    for s, a in loader:
        pred = policy(s)
        loss = ((pred - a) ** 2).mean()
        opt.zero_grad(); loss.backward(); opt.step()

That's a working behavior-cloning policy. Swap the MLP for a CNN if your state is images. Swap for a transformer if you have long context. The training loop is the same.

Why behavior cloning usually fails

The classic result: train a BC policy on 1000 demonstrations of driving down a highway lane. Deploy. It drifts slightly off-center, because the policy is imperfect. Now it's in a state it's never seen (slightly off-center), so its next action is worse. It drifts further. Within 10 seconds the car is in a ditch.

This is covariate shift. The training distribution is "states the expert visited." The test distribution is "states the student visits." Once the student makes one mistake, it's off the training manifold and every subsequent prediction degrades. Small errors compound.

Formally: if the policy has error probability ε per step, over T steps it has roughly Tε total expected error if you're lucky, T²ε if you're not. That quadratic is the killer.

DAgger: the fix

DAgger (Dataset Aggregation, Ross et al. 2011) solves covariate shift directly: collect more demonstrations in states the student actually visits.

The loop:

  1. Train π on the initial demos.
  2. Roll out π on the robot. Record the states it visits.
  3. Ask the expert what they would have done in each of those states. (This is the annoying part.)
  4. Add the new (state, expert action) pairs to the dataset.
  5. Retrain. Repeat.

After a few iterations, the training distribution covers the states the student actually sees. Errors don't compound — they get corrected. The covariate shift problem disappears.

The catch: you need the expert to keep labeling, including in states they'd never naturally visit. Fast online access to an expert is expensive.

Modern variants

  • HG-DAgger, EnsembleDAgger: only query the expert when the policy is uncertain, cutting expert-label cost.
  • IntervenedDAgger: in teleop mode, the expert takes over mid-episode when the policy goes off the rails. Easier to collect than labeled corrections.
  • Action chunking (ACT): instead of predicting one action at a time, predict the next N actions. Averages over noise and dramatically cuts compounding errors. Used in ALOHA and many VLAs.
  • Diffusion policy: model the action distribution with a denoising diffusion model. Multimodal actions become possible (robot can choose either left or right without averaging them into straight ahead).

Where behavior cloning is enough

  • Tasks where the expert always reaches the same goal the same way (low variance).
  • Tasks with short horizons (error compounding has little time to happen).
  • Tasks where you can collect huge datasets (scale shrinks ε).

Modern VLAs fit the third case: train on hundreds of thousands of trajectories, cover the state space so densely that covariate shift becomes small. They still use imitation learning — they just scale it until the bug becomes a feature.

When to reach for RL instead

  • Tasks where demonstrations are hard to provide but the reward is easy to evaluate (e.g., "don't fall down").
  • Tasks where superhuman performance matters — BC is capped by the expert.
  • Tasks in simulation where you can collect effectively unlimited experience.

In practice, most modern robot-learning systems are hybrid: pretrain with imitation learning, fine-tune with RL. That's roughly the recipe for everything from OpenAI's Rubik's cube solver to modern humanoid locomotion.

A weekend project

Record 50 demonstrations of yourself controlling a simulated arm to pick up a block. Train a BC policy. Roll it out 20 times. Measure success rate. Then try DAgger for 3 iterations — you intervene when the policy fails. Measure the new success rate. This experiment takes a weekend with LeRobot or MuJoCo and teaches you more about BC than a month of papers.

What this connects to

Next: diffusion policy — the 2023 architecture that changed the game on multimodal demonstrations. Then ACT and action chunking. Then, finally, what all of these look like inside modern VLAs.

Comments

    Sign in to post a comment.