ACT and action chunking
The architecture that made bimanual teleop policies actually work. Predict 16 actions at once, execute open-loop, repeat. Behind ALOHA, Mobile ALOHA, and many of the 2024+ VLAs.
Action Chunking Transformer (Zhao et al., 2023) is the imitation-learning architecture behind ALOHA, Mobile ALOHA, and many of the 2024–26 VLAs. The key trick is in the name: predict chunks of future actions, not single ones. The trick is small but the empirical impact is huge.
The compounding-error problem (revisited)
Behavior cloning with single-step actions has an inherent problem: each prediction has small errors, errors take the robot off-distribution, off-distribution predictions are worse. Errors compound quadratically over horizon. The Imitation Learning 101 lesson covers DAgger as one fix; ACT brings a different one.
ACT's solution: predict the next H actions all at once
Given the current observation, predict actions a_t, a_{t+1}, ..., a_{t+H-1} simultaneously. Execute the first one. Re-predict at the next step (or, more commonly, execute several before re-predicting).
This decouples decision frequency from execution frequency. The network sees the broader trajectory structure, produces smoother actions, and amortizes inference cost over many timesteps.
Three things ACT adds beyond chunking
1. Conditional VAE
The training data has stochasticity (different demonstrators do the same task differently). ACT models this by adding a latent variable z that captures the demonstration "style" and predicts actions conditioned on z. At inference, sample z from the prior.
Result: ACT can reproduce the diversity in demonstrations rather than averaging across them.
2. Transformer architecture
Encoder: ResNet on each camera, features concatenated and projected. Joint state appended.
Decoder: small transformer that attends to encoder features and outputs the action sequence. ~20M parameters; trains in hours on a single GPU.
3. Temporal ensembling at inference
At each timestep, the network predicted actions for t, t+1, ..., t+H-1 based on past observations. The current command for time t can be a weighted average of all the predictions for t that have been made over the last H steps. This smooths over prediction noise dramatically.
The weighting: exponential decay with most recent predictions dominating. Empirically halves jitter compared to naive open-loop chunk execution.
Why ACT works for ALOHA-style tasks
ALOHA is bimanual: two arms with grippers, often with delicate coordination required (handing items, opening containers). The chunked prediction lets the network plan multi-step coordination ("right hand pick up bottle, left hand hold cap, right hand twist") that single-step cloning would miss.
Combined with ALOHA's high-quality teleoperation hardware (puppet-style master/follower), ACT achieved 90%+ success on tasks that previously needed RL or hand-engineered policies.
The implementation
class ACT(nn.Module):
def __init__(self, ...):
# Vision encoders (one per camera)
self.cam_encoders = nn.ModuleList([resnet18() for _ in range(n_cams)])
# CVAE encoder (style)
self.style_encoder = TransformerEncoder(...)
# Action decoder
self.action_decoder = TransformerDecoder(num_layers=4, d_model=512)
self.action_head = nn.Linear(512, action_dim)
def forward(self, images, joint_state, action_targets=None):
cam_feats = torch.cat([enc(im) for enc, im in zip(self.cam_encoders, images)], dim=1)
# Style latent z
if action_targets is not None: # training
z = self.style_encoder(action_targets, joint_state)
else:
z = sample_from_prior()
# Predict action chunk
memory = torch.cat([cam_feats, joint_state, z], dim=1)
chunk = self.action_decoder(query=positional_embeddings, memory=memory)
return self.action_head(chunk) # shape (H, action_dim)
~150 lines of PyTorch in total. Train on a LeRobotDataset with the standard imitation-learning loss + KL term on the latent z.
Empirical numbers
- Original ALOHA (2023): 90% on threading a zip tie, 97% on opening a cup.
- Mobile ALOHA (2024): similar success on tasks requiring base motion.
- OpenVLA: ACT-style action decoder; trained on Open X-Embodiment.
- π0 (replaces ACT with flow matching): incremental improvement in smoothness.
ACT vs Diffusion Policy: when to pick which
| ACT | Diffusion Policy | |
|---|---|---|
| Inference speed | Single forward pass (~10 ms) | 10–50 steps (~50 ms) |
| Multimodal demos | Good (CVAE) | Better (full distribution) |
| Implementation simplicity | Standard transformer | UNet + scheduler |
| Best for | Real-time control, modest data variance | Highly multimodal demos, rich data |
For most teams: start with ACT. Switch to diffusion if you observe action averaging on multimodal data.
Production gotchas
- Chunk length: too short = same compounding-error problem; too long = stale chunks. 16 timesteps at 50 Hz (0.32 s) is the standard.
- Camera count: more cameras → more compute, often better policies. Two (scene + wrist) is the practical minimum.
- Joint vs end-effector action: end-effector is more general; joint is sometimes more precise. Choose based on task.
- Action normalization: per-channel z-score is standard. Forget this and the loss is dominated by big-magnitude joints.
Exercise
In LeRobot, train an ACT policy on the SO-100 PushT example. Compare success rate with chunk size 8, 16, 32. The 16-step default is usually best; longer chunks plateau and shorter ones revert to the compounding-error regime.
Next
RL primer for roboticists — the other half of robot learning, distilled to what a manipulation engineer needs.
Comments
Sign in to post a comment.