Production VLA fine-tuning playbook
OpenVLA, π0, GR00T — collect 200 demonstrations, LoRA-fine-tune, deploy. The end-to-end pipeline that goes from teleop rig to working policy on your specific arm in a weekend.
VLAs are no longer a closed-source vendor demo. In 2026 a hobbyist with a $500 SO-100 arm, a 24 GB GPU, and a weekend can fine-tune OpenVLA or π0 on their own task and deploy. Here's the production playbook — what to collect, how to train, and the diagnostic checks that catch failures before they hit hardware.
The pipeline at a glance
- Set up a teleop rig — phone, GELLO clone, or an ALOHA puppet.
- Collect 100–500 demonstrations of one specific task.
- Convert to the LeRobot dataset format.
- LoRA-fine-tune a base VLA (OpenVLA, π0, SmolVLA).
- Evaluate in sim on held-out scenarios.
- Deploy on the real robot. Iterate.
Each step has its own gotchas. Below, in order.
1. Pick the right base model
| Model | Params | Best for | License |
|---|---|---|---|
| OpenVLA | 7B | Single-arm tasks, the most-tested choice in 2026 | MIT |
| π0 (OpenPI) | 3B | Smoother actions (flow-matching), bimanual tasks | Apache-2.0 |
| SmolVLA | 450M | Edge deployment, 12GB GPU sufficient | Apache-2.0 |
| GR00T N1.5 | 1.5B | Humanoid whole-body | Apache-2.0 |
Default for first fine-tune: OpenVLA. The most documented, most-validated model with the biggest community of fine-tunes to learn from. Move to π0 if your task needs smoother trajectories or bimanual coordination.
2. Demonstrations: quality over quantity
The single most-important determinant of fine-tune quality. Good rules:
- 100 demos minimum, 200–300 is the sweet spot for a single-task fine-tune.
- Vary the scene. Different lighting, different starting positions of objects, different distractors in the background. Each demo should look slightly different.
- One task per dataset. Don't mix "pick up the cup" and "stack blocks" in the same fine-tune; performance suffers in both.
- Include recoveries. Drop the object on purpose, then pick it back up. Your robot will fail in the field; the demo set should show it how to recover.
- Use natural-language labels. Each episode tags itself with the instruction in plain English. The VLA exploits these.
3. Teleop rig options
- Phone teleop — point the camera, tap drags map to gripper position. Simplest, lowest fidelity. Good for proof-of-concept on simple arms.
- GELLO (open-source) — a small puppet of your robot's arm, joint-encoded. The user moves the puppet; the real arm mirrors. Cheap (~$200) and fast (you can do 50 demos/hour).
- ALOHA — same idea but bimanual. Required for any bimanual fine-tune.
- VR controllers — Quest, Vive. Higher friction to set up; less precise than puppets.
Recommendation: build or buy a GELLO clone for your first fine-tune. The collection rate compounds — you'll have 200 demos in 4 hours, training takes 6, evaluation takes 1, total weekend.
4. Dataset format
LeRobot is the de-facto standard:
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
dataset = LeRobotDataset.create(
repo_id="my-username/pick-up-cup",
fps=30,
robot_type="so100",
features={
"observation.images.scene": {"dtype": "video", "shape": (480, 640, 3)},
"observation.images.wrist": {"dtype": "video", "shape": (240, 320, 3)},
"observation.state": {"dtype": "float32", "shape": (6,)},
"action": {"dtype": "float32", "shape": (6,)},
},
)
# In the teleop loop:
dataset.add_frame({"observation.images.scene": frame, ...})
dataset.save_episode(task="pick up the blue cup")
Two cameras (one scene, one wrist) and proprioceptive state are the standard observation set. Push to Hugging Face for portability and reuse.
5. LoRA fine-tuning
Full fine-tuning a 7B model needs 80+ GB of VRAM. LoRA (Low-Rank Adaptation) trains only ~1% of parameters and fits in 24 GB. Quality is nearly identical for single-task fine-tunes.
from lerobot.scripts.train import train
train(
policy_class="openvla",
base_checkpoint="openvla/openvla-7b",
dataset_repo_id="my-username/pick-up-cup",
use_lora=True,
lora_rank=32,
lora_target_modules=["q_proj", "v_proj"],
train_batch_size=4,
learning_rate=5e-4,
num_steps=20_000,
save_every=2_000,
output_dir="./checkpoints/pick-up-cup",
)
Numbers that work in practice for OpenVLA:
- LoRA rank: 16–32. Higher = more capacity, more VRAM.
- Learning rate: 5e-4 for LoRA (vs 1e-5 for full fine-tune).
- Batch size: as big as fits. 4 on a 24 GB card.
- Steps: 20–50k for 200 demos. Watch eval loss; stop when it plateaus.
- Time: ~6 hours on an RTX 4090.
6. Evaluation discipline
Two evaluation regimes, both required:
Sim eval
Drop your fine-tuned model into a held-out sim setup (LIBERO is the standard benchmark). Measure success rate over 50–100 trials with random initial conditions. Look for: success rate above 85%, action smoothness (no high-frequency jitter), handles distractor objects.
Real-world eval
Don't deploy until sim looks good. Then on hardware: 30 trials, log the robot video for each. Categorize failures: never approached, dropped object, hit obstacle, timed out. Each failure mode points at a different fix (more demos, more domain randomization, better collision handling).
The five most common failure modes
- Open loop drift — model executes 5 seconds of correct motion, then meanders. Fix: more demos with longer time horizons; use action chunking with chunk size 16–32.
- Misses the object — model approaches consistently 1–2 cm off. Fix: collect demos with deliberate variation in object placement; add wrist-camera view if you only had scene-camera.
- Sim-to-real gap on lighting — works in your studio, fails in a friend's apartment. Fix: collect demos under varied lighting (window light, overhead light, mixed). Color-jitter augmentation in training.
- Action saturation — model commands joint angles outside the robot's range, the safety stop trips. Fix: clamp predictions during inference; verify your action normalization matches the training data's normalization.
- Catastrophic forgetting — fine-tuned model forgot tasks the base model could do. Fix: smaller LoRA rank, or use rehearsal data from the base mix.
Inference latency
OpenVLA inference is ~150 ms per frame on a 24 GB GPU. Two tactics for real-time use:
- Action chunking: predict 16 actions, execute them open-loop for 16 frames, then predict again. Effective rate: 6–10 Hz, plenty for most tasks.
- SmolVLA on edge: 450M model runs at 30+ Hz on a Jetson Orin. Quality drop is real but usable for simpler tasks.
Honest expectations
Even with everything right, expect:
- ~85% success on tasks similar to demos.
- ~60% success on tasks slightly outside the demo distribution (different lighting, novel object).
- ~10% success on tasks substantially different. The model won't tell you it's outside its distribution; it'll just fail confidently.
This is materially better than what was possible in 2024. It is not yet "deploy and forget." It is "iterate on demos for two weekends and have a working capability."
Where to learn more
- OpenVLA repo — fine-tuning script + LoRA recipes
- π0 OpenPI — open release of π0 from Physical Intelligence
- LeRobot — dataset format, training, deployment glue
Exercise
Build (or buy) a GELLO arm. Pick a single, narrow task — "pick up the orange block and put it in the box." Collect 200 demos in one afternoon. Fine-tune OpenVLA overnight. Run 30 evals on hardware the next morning. Compute success rate. You now know what 2026's frontier feels like — and the diff between "demo" and "production" is mostly more of the above.
Next
Embodied reasoning with LLM agents — what to do when even a fine-tuned VLA can't handle the long-horizon, multi-step plan you need.
Comments
Sign in to post a comment.