Published2026-04-24·~18 min

Production VLA fine-tuning playbook

OpenVLA, π0, GR00T — collect 200 demonstrations, LoRA-fine-tune, deploy. The end-to-end pipeline that goes from teleop rig to working policy on your specific arm in a weekend.

by RobotForge

#frontiers#vla#fine-tuning#imitation-learning

VLAs are no longer a closed-source vendor demo. In 2026 a hobbyist with a $500 SO-100 arm, a 24 GB GPU, and a weekend can fine-tune OpenVLA or π0 on their own task and deploy. Here's the production playbook — what to collect, how to train, and the diagnostic checks that catch failures before they hit hardware.

The pipeline at a glance

Set up a teleop rig — phone, GELLO clone, or an ALOHA puppet.
Collect 100–500 demonstrations of one specific task.
Convert to the LeRobot dataset format.
LoRA-fine-tune a base VLA (OpenVLA, π0, SmolVLA).
Evaluate in sim on held-out scenarios.
Deploy on the real robot. Iterate.

Each step has its own gotchas. Below, in order.

1. Pick the right base model

Model	Params	Best for	License
OpenVLA	7B	Single-arm tasks, the most-tested choice in 2026	MIT
π0 (OpenPI)	3B	Smoother actions (flow-matching), bimanual tasks	Apache-2.0
SmolVLA	450M	Edge deployment, 12GB GPU sufficient	Apache-2.0
GR00T N1.5	1.5B	Humanoid whole-body	Apache-2.0

Default for first fine-tune: OpenVLA. The most documented, most-validated model with the biggest community of fine-tunes to learn from. Move to π0 if your task needs smoother trajectories or bimanual coordination.

2. Demonstrations: quality over quantity

The single most-important determinant of fine-tune quality. Good rules:

100 demos minimum, 200–300 is the sweet spot for a single-task fine-tune.
Vary the scene. Different lighting, different starting positions of objects, different distractors in the background. Each demo should look slightly different.
One task per dataset. Don't mix "pick up the cup" and "stack blocks" in the same fine-tune; performance suffers in both.
Include recoveries. Drop the object on purpose, then pick it back up. Your robot will fail in the field; the demo set should show it how to recover.
Use natural-language labels. Each episode tags itself with the instruction in plain English. The VLA exploits these.

3. Teleop rig options

Phone teleop — point the camera, tap drags map to gripper position. Simplest, lowest fidelity. Good for proof-of-concept on simple arms.
GELLO (open-source) — a small puppet of your robot's arm, joint-encoded. The user moves the puppet; the real arm mirrors. Cheap (~$200) and fast (you can do 50 demos/hour).
ALOHA — same idea but bimanual. Required for any bimanual fine-tune.
VR controllers — Quest, Vive. Higher friction to set up; less precise than puppets.

Recommendation: build or buy a GELLO clone for your first fine-tune. The collection rate compounds — you'll have 200 demos in 4 hours, training takes 6, evaluation takes 1, total weekend.

4. Dataset format

LeRobot is the de-facto standard:

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

dataset = LeRobotDataset.create(
    repo_id="my-username/pick-up-cup",
    fps=30,
    robot_type="so100",
    features={
        "observation.images.scene": {"dtype": "video", "shape": (480, 640, 3)},
        "observation.images.wrist": {"dtype": "video", "shape": (240, 320, 3)},
        "observation.state": {"dtype": "float32", "shape": (6,)},
        "action": {"dtype": "float32", "shape": (6,)},
    },
)

# In the teleop loop:
dataset.add_frame({"observation.images.scene": frame, ...})
dataset.save_episode(task="pick up the blue cup")

Two cameras (one scene, one wrist) and proprioceptive state are the standard observation set. Push to Hugging Face for portability and reuse.

5. LoRA fine-tuning

Full fine-tuning a 7B model needs 80+ GB of VRAM. LoRA (Low-Rank Adaptation) trains only ~1% of parameters and fits in 24 GB. Quality is nearly identical for single-task fine-tunes.

from lerobot.scripts.train import train

train(
    policy_class="openvla",
    base_checkpoint="openvla/openvla-7b",
    dataset_repo_id="my-username/pick-up-cup",
    use_lora=True,
    lora_rank=32,
    lora_target_modules=["q_proj", "v_proj"],
    train_batch_size=4,
    learning_rate=5e-4,
    num_steps=20_000,
    save_every=2_000,
    output_dir="./checkpoints/pick-up-cup",
)

Numbers that work in practice for OpenVLA:

LoRA rank: 16–32. Higher = more capacity, more VRAM.
Learning rate: 5e-4 for LoRA (vs 1e-5 for full fine-tune).
Batch size: as big as fits. 4 on a 24 GB card.
Steps: 20–50k for 200 demos. Watch eval loss; stop when it plateaus.
Time: ~6 hours on an RTX 4090.

6. Evaluation discipline

Two evaluation regimes, both required:

Sim eval

Drop your fine-tuned model into a held-out sim setup (LIBERO is the standard benchmark). Measure success rate over 50–100 trials with random initial conditions. Look for: success rate above 85%, action smoothness (no high-frequency jitter), handles distractor objects.

Real-world eval

Don't deploy until sim looks good. Then on hardware: 30 trials, log the robot video for each. Categorize failures: never approached, dropped object, hit obstacle, timed out. Each failure mode points at a different fix (more demos, more domain randomization, better collision handling).

The five most common failure modes

Open loop drift — model executes 5 seconds of correct motion, then meanders. Fix: more demos with longer time horizons; use action chunking with chunk size 16–32.
Misses the object — model approaches consistently 1–2 cm off. Fix: collect demos with deliberate variation in object placement; add wrist-camera view if you only had scene-camera.
Sim-to-real gap on lighting — works in your studio, fails in a friend's apartment. Fix: collect demos under varied lighting (window light, overhead light, mixed). Color-jitter augmentation in training.
Action saturation — model commands joint angles outside the robot's range, the safety stop trips. Fix: clamp predictions during inference; verify your action normalization matches the training data's normalization.
Catastrophic forgetting — fine-tuned model forgot tasks the base model could do. Fix: smaller LoRA rank, or use rehearsal data from the base mix.

Inference latency

OpenVLA inference is ~150 ms per frame on a 24 GB GPU. Two tactics for real-time use:

Action chunking: predict 16 actions, execute them open-loop for 16 frames, then predict again. Effective rate: 6–10 Hz, plenty for most tasks.
SmolVLA on edge: 450M model runs at 30+ Hz on a Jetson Orin. Quality drop is real but usable for simpler tasks.

Honest expectations

Even with everything right, expect:

~85% success on tasks similar to demos.
~60% success on tasks slightly outside the demo distribution (different lighting, novel object).
~10% success on tasks substantially different. The model won't tell you it's outside its distribution; it'll just fail confidently.

This is materially better than what was possible in 2024. It is not yet "deploy and forget." It is "iterate on demos for two weekends and have a working capability."

Where to learn more

OpenVLA repo — fine-tuning script + LoRA recipes
π0 OpenPI — open release of π0 from Physical Intelligence
LeRobot — dataset format, training, deployment glue

Exercise

Build (or buy) a GELLO arm. Pick a single, narrow task — "pick up the orange block and put it in the box." Collect 200 demos in one afternoon. Fine-tune OpenVLA overnight. Run 30 evals on hardware the next morning. Compute success rate. You now know what 2026's frontier feels like — and the diff between "demo" and "production" is mostly more of the above.

Embodied reasoning with LLM agents — what to do when even a fine-tuned VLA can't handle the long-horizon, multi-step plan you need.

← Previous

Tactile sensing: GelSight, DIGIT, e-skins

Embodied reasoning with LLM agents