RobotForge
Published·~12 min

ACT and action chunking

The architecture that made bimanual teleop policies actually work. Predict 16 actions at once, execute open-loop, repeat. Behind ALOHA, Mobile ALOHA, and many of the 2024+ VLAs.

by RobotForge
#learning#act#imitation

Action Chunking Transformer (Zhao et al., 2023) is the imitation-learning architecture behind ALOHA, Mobile ALOHA, and many of the 2024–26 VLAs. The key trick is in the name: predict chunks of future actions, not single ones. The trick is small but the empirical impact is huge.

The compounding-error problem (revisited)

Behavior cloning with single-step actions has an inherent problem: each prediction has small errors, errors take the robot off-distribution, off-distribution predictions are worse. Errors compound quadratically over horizon. The Imitation Learning 101 lesson covers DAgger as one fix; ACT brings a different one.

ACT's solution: predict the next H actions all at once

Given the current observation, predict actions a_t, a_{t+1}, ..., a_{t+H-1} simultaneously. Execute the first one. Re-predict at the next step (or, more commonly, execute several before re-predicting).

This decouples decision frequency from execution frequency. The network sees the broader trajectory structure, produces smoother actions, and amortizes inference cost over many timesteps.

Three things ACT adds beyond chunking

1. Conditional VAE

The training data has stochasticity (different demonstrators do the same task differently). ACT models this by adding a latent variable z that captures the demonstration "style" and predicts actions conditioned on z. At inference, sample z from the prior.

Result: ACT can reproduce the diversity in demonstrations rather than averaging across them.

2. Transformer architecture

Encoder: ResNet on each camera, features concatenated and projected. Joint state appended.

Decoder: small transformer that attends to encoder features and outputs the action sequence. ~20M parameters; trains in hours on a single GPU.

3. Temporal ensembling at inference

At each timestep, the network predicted actions for t, t+1, ..., t+H-1 based on past observations. The current command for time t can be a weighted average of all the predictions for t that have been made over the last H steps. This smooths over prediction noise dramatically.

The weighting: exponential decay with most recent predictions dominating. Empirically halves jitter compared to naive open-loop chunk execution.

Why ACT works for ALOHA-style tasks

ALOHA is bimanual: two arms with grippers, often with delicate coordination required (handing items, opening containers). The chunked prediction lets the network plan multi-step coordination ("right hand pick up bottle, left hand hold cap, right hand twist") that single-step cloning would miss.

Combined with ALOHA's high-quality teleoperation hardware (puppet-style master/follower), ACT achieved 90%+ success on tasks that previously needed RL or hand-engineered policies.

The implementation

class ACT(nn.Module):
    def __init__(self, ...):
        # Vision encoders (one per camera)
        self.cam_encoders = nn.ModuleList([resnet18() for _ in range(n_cams)])
        # CVAE encoder (style)
        self.style_encoder = TransformerEncoder(...)
        # Action decoder
        self.action_decoder = TransformerDecoder(num_layers=4, d_model=512)
        self.action_head = nn.Linear(512, action_dim)

    def forward(self, images, joint_state, action_targets=None):
        cam_feats = torch.cat([enc(im) for enc, im in zip(self.cam_encoders, images)], dim=1)
        # Style latent z
        if action_targets is not None:  # training
            z = self.style_encoder(action_targets, joint_state)
        else:
            z = sample_from_prior()
        # Predict action chunk
        memory = torch.cat([cam_feats, joint_state, z], dim=1)
        chunk = self.action_decoder(query=positional_embeddings, memory=memory)
        return self.action_head(chunk)  # shape (H, action_dim)

~150 lines of PyTorch in total. Train on a LeRobotDataset with the standard imitation-learning loss + KL term on the latent z.

Empirical numbers

  • Original ALOHA (2023): 90% on threading a zip tie, 97% on opening a cup.
  • Mobile ALOHA (2024): similar success on tasks requiring base motion.
  • OpenVLA: ACT-style action decoder; trained on Open X-Embodiment.
  • π0 (replaces ACT with flow matching): incremental improvement in smoothness.

ACT vs Diffusion Policy: when to pick which

ACT Diffusion Policy
Inference speedSingle forward pass (~10 ms)10–50 steps (~50 ms)
Multimodal demosGood (CVAE)Better (full distribution)
Implementation simplicityStandard transformerUNet + scheduler
Best forReal-time control, modest data varianceHighly multimodal demos, rich data

For most teams: start with ACT. Switch to diffusion if you observe action averaging on multimodal data.

Production gotchas

  • Chunk length: too short = same compounding-error problem; too long = stale chunks. 16 timesteps at 50 Hz (0.32 s) is the standard.
  • Camera count: more cameras → more compute, often better policies. Two (scene + wrist) is the practical minimum.
  • Joint vs end-effector action: end-effector is more general; joint is sometimes more precise. Choose based on task.
  • Action normalization: per-channel z-score is standard. Forget this and the loss is dominated by big-magnitude joints.

Exercise

In LeRobot, train an ACT policy on the SO-100 PushT example. Compare success rate with chunk size 8, 16, 32. The 16-step default is usually best; longer chunks plateau and shorter ones revert to the compounding-error regime.

Next

RL primer for roboticists — the other half of robot learning, distilled to what a manipulation engineer needs.

Comments

    Sign in to post a comment.